Hello Luben,

On Wed, 21 Jun 2023 09:56:40 -0400
Luben Tuikov <[email protected]> wrote:

> On 2023-06-19 03:19, Boris Brezillon wrote:
> > drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped
> > from the dependency array that was waited upon before
> > drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
> > so we're basically waiting for all dependencies except one.
> > 
> > In theory, this wait shouldn't be needed because resources should have
> > their users registered to the dma_resv object, thus guaranteeing that
> > future jobs wanting to access these resources wait on all the previous
> > users (depending on the access type, of course). But we want to keep
> > these explicit waits in the kill entity path just in case.
> > 
> > Let's make sure we keep all dependencies in the array in
> > drm_sched_job_dependency(), so we can iterate over the array and wait
> > in drm_sched_entity_kill_jobs_cb().
> > 
> > We also make sure we wait on drm_sched_fence::finished if we were
> > originally asked to wait on drm_sched_fence::scheduled. In that case,
> > we assume the intent was to delegate the wait to the firmware/GPU or
> > rely on the pipelining done at the entity/scheduler level, but when
> > killing jobs, we really want to wait for completion not just scheduling.
> > 
> > v6:
> > - Back to v4 implementation
> > - Add Christian's R-b
> > 
> > v5:
> > - Flag deps on which we should only wait for the scheduled event
> >   at insertion time
> > 
> > v4:
> > - Fix commit message
> > - Fix a use-after-free bug
> > 
> > v3:
> > - Always wait for drm_sched_fence::finished fences in
> >   drm_sched_entity_kill_jobs_cb() when we see a sched_fence
> > 
> > v2:
> > - Don't evict deps in drm_sched_job_dependency()  
> 
> Hmm, why is this in reverse chronological order?
> It's very confusing.

Dunno, that's how I've always ordered things, and quick look at some
dri-devel patches [1][2] makes me think I'm not the only one to start
from the latest submission.

[1]https://lkml.org/lkml/2023/6/19/941
[2]https://lore.kernel.org/dri-devel/[email protected]/T/#t

> 
> > 
> > Signed-off-by: Boris Brezillon <[email protected]>
> > Suggested-by: "Christian König" <[email protected]>
> > Reviewed-by: "Christian König" <[email protected]>  
> 
> These three lines would usually come after the CCs.

Again, I think I've always inserted those tags before the Cc, but I can
re-order things if you prefer. Let me know if you want me to send a v7
addressing the Cc+changelog ordering.

Regards,

Boris

> 
> Regards,
> Luben
> 
> > Cc: Frank Binns <[email protected]>
> > Cc: Sarah Walker <[email protected]>
> > Cc: Donald Robson <[email protected]>
> > Cc: Luben Tuikov <[email protected]>
> > Cc: David Airlie <[email protected]>
> > Cc: Daniel Vetter <[email protected]>
> > Cc: Sumit Semwal <[email protected]>
> > Cc: "Christian König" <[email protected]>
> > ---
> >  drivers/gpu/drm/scheduler/sched_entity.c | 41 +++++++++++++++++++-----
> >  1 file changed, 33 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
> > b/drivers/gpu/drm/scheduler/sched_entity.c
> > index 68e807ae136a..ec41d82d0141 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -176,16 +176,32 @@ static void drm_sched_entity_kill_jobs_cb(struct 
> > dma_fence *f,
> >  {
> >     struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
> >                                              finish_cb);
> > -   int r;
> > +   unsigned long index;
> >  
> >     dma_fence_put(f);
> >  
> >     /* Wait for all dependencies to avoid data corruptions */
> > -   while (!xa_empty(&job->dependencies)) {
> > -           f = xa_erase(&job->dependencies, job->last_dependency++);
> > -           r = dma_fence_add_callback(f, &job->finish_cb,
> > -                                      drm_sched_entity_kill_jobs_cb);
> > -           if (!r)
> > +   xa_for_each(&job->dependencies, index, f) {
> > +           struct drm_sched_fence *s_fence = to_drm_sched_fence(f);
> > +
> > +           if (s_fence && f == &s_fence->scheduled) {
> > +                   /* The dependencies array had a reference on the 
> > scheduled
> > +                    * fence, and the finished fence refcount might have
> > +                    * dropped to zero. Use dma_fence_get_rcu() so we get
> > +                    * a NULL fence in that case.
> > +                    */
> > +                   f = dma_fence_get_rcu(&s_fence->finished);
> > +
> > +                   /* Now that we have a reference on the finished fence,
> > +                    * we can release the reference the dependencies array
> > +                    * had on the scheduled fence.
> > +                    */
> > +                   dma_fence_put(&s_fence->scheduled);
> > +           }
> > +
> > +           xa_erase(&job->dependencies, index);
> > +           if (f && !dma_fence_add_callback(f, &job->finish_cb,
> > +                                            drm_sched_entity_kill_jobs_cb))
> >                     return;
> >  
> >             dma_fence_put(f);
> > @@ -415,8 +431,17 @@ static struct dma_fence *
> >  drm_sched_job_dependency(struct drm_sched_job *job,
> >                      struct drm_sched_entity *entity)
> >  {
> > -   if (!xa_empty(&job->dependencies))
> > -           return xa_erase(&job->dependencies, job->last_dependency++);
> > +   struct dma_fence *f;
> > +
> > +   /* We keep the fence around, so we can iterate over all dependencies
> > +    * in drm_sched_entity_kill_jobs_cb() to ensure all deps are signaled
> > +    * before killing the job.
> > +    */
> > +   f = xa_load(&job->dependencies, job->last_dependency);
> > +   if (f) {
> > +           job->last_dependency++;
> > +           return dma_fence_get(f);
> > +   }
> >  
> >     if (job->sched->ops->prepare_job)
> >             return job->sched->ops->prepare_job(job, entity);  
> 

Reply via email to