amdgpu: set an error on all fences from a bad context

Alex Deucher Mon, 15 Sep 2025 10:20:11 -0700

On Mon, Sep 15, 2025 at 10:32 AM Christian König
<christian.koe...@amd.com> wrote:
>
> On 15.09.25 16:24, Alex Deucher wrote:
> > On Mon, Sep 15, 2025 at 9:59 AM Christian König
> > <christian.koe...@amd.com> wrote:
> >>
> >>
> >>
> >> On 15.09.25 15:33, Alex Deucher wrote:
> >>> When we backup ring contents to reemit after a queue reset,
> >>> we don't backup ring contents from the bad context.  When
> >>> we signal the fences, we should set an error on those
> >>> fences as well.
> >>>
> >>> v2: misc cleanups
> >>> v3: add locking for fence error, fix comment (Christian)
> >>>
> >>> Fixes: 77cc0da39c7c ("drm/amdgpu: track ring state associated with a 
> >>> fence")
> >>> Signed-off-by: Alex Deucher <alexander.deuc...@amd.com>
> >>> ---
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 33 ++++++++++++++++++++---
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c  |  2 +-
> >>>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  |  2 +-
> >>>  3 files changed, 31 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>> index fd8cca241da62..72f0f16924476 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> >>> @@ -758,11 +758,36 @@ void amdgpu_fence_driver_force_completion(struct 
> >>> amdgpu_ring *ring)
> >>>   * @fence: fence of the ring to signal
> >>>   *
> >>>   */
> >>> -void amdgpu_fence_driver_guilty_force_completion(struct amdgpu_fence 
> >>> *fence)
> >>> +void amdgpu_fence_driver_guilty_force_completion(struct amdgpu_fence *af)
> >>>  {
> >>> -     dma_fence_set_error(&fence->base, -ETIME);
> >>> -     amdgpu_fence_write(fence->ring, fence->seq);
> >>> -     amdgpu_fence_process(fence->ring);
> >>> +     struct dma_fence *unprocessed;
> >>> +     struct dma_fence __rcu **ptr;
> >>> +     struct amdgpu_fence *fence;
> >>> +     struct amdgpu_ring *ring = af->ring;
> >>> +     unsigned long flags;
> >>> +     u64 i, seqno;
> >>> +
> >>> +     seqno = amdgpu_fence_read(ring);
> >>> +
> >>> +     spin_lock_irqsave(&ring->fence_drv.lock, flags);
> >>> +     for (i = seqno + 1; i <= ring->fence_drv.sync_seq; ++i) {
> >>
> >> That won't work, the seqno can wrap around.
> >>
> >> I would just go over all fences, e.g. from 0 to end of array.
> >>
> >> The checks below should make sure that we don't try to set an error on 
> >> something already processed.
> >>
> >>> +             ptr = &ring->fence_drv.fences[i & 
> >>> ring->fence_drv.num_fences_mask];
> >>> +             rcu_read_lock();
> >>> +             unprocessed = rcu_dereference(*ptr);
> >>> +
> >>> +             if (unprocessed && !dma_fence_is_signaled(unprocessed)) {
> >>
> >> dma_fence_is_signaled_locked(), otherwise the function would try to lock 
> >> the same spinlock again.
> >>
> >>> +                     fence = container_of(unprocessed, struct 
> >>> amdgpu_fence, base);
> >>> +
> >>> +                     if (fence == af)
> >>> +                             dma_fence_set_error(&fence->base, -ETIME);
> >>> +                     else if (fence->context == af->context)
> >>> +                             dma_fence_set_error(&fence->base, 
> >>> -ECANCELED);
> >>> +             }
> >>> +             rcu_read_unlock();
> >>> +     }
> >>> +     spin_unlock_irqrestore(&ring->fence_drv.lock, flags);
> >>
> >>> +     amdgpu_fence_write(ring, af->seq);
> >>> +     amdgpu_fence_process(ring);
> >>
> >> That's actually wrong. We want the other fences to signal later on when we 
> >> process them.
> >
> > This fence is the original guilty fence.  The other fences will
> > naturally signal when the later fences signal.
>
> Ah, got it. But that is still racy as hell.
>
> We should probably rather signal the guilty fence separately by calling 
> dma_fence_signal().


What is it racing with?  amdgpu_fence_process() ultimately calls
dma_fence_signal(), plus it handles various AMD specific bookkeeping.

Alex

>
> Regards,
> Christian.
>
> >
> > Alex
> >
> >>
> >> Regards,
> >> Christian.
> >>
> >>>  }
> >>>
> >>>  void amdgpu_fence_save_wptr(struct dma_fence *fence)
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> index 6379bb25bf5ce..f6c9decedbd51 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> @@ -812,7 +812,7 @@ int amdgpu_ring_reset_helper_end(struct amdgpu_ring 
> >>> *ring,
> >>>       if (r)
> >>>               return r;
> >>>
> >>> -     /* signal the fence of the bad job */
> >>> +     /* signal the guilty fence and set an error on all fences from the 
> >>> context */
> >>>       if (guilty_fence)
> >>>               amdgpu_fence_driver_guilty_force_completion(guilty_fence);
> >>>       /* Re-emit the non-guilty commands */
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> index 7670f5d82b9e4..0704fd9b85fdb 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> @@ -155,7 +155,7 @@ extern const struct drm_sched_backend_ops 
> >>> amdgpu_sched_ops;
> >>>  void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring);
> >>>  void amdgpu_fence_driver_set_error(struct amdgpu_ring *ring, int error);
> >>>  void amdgpu_fence_driver_force_completion(struct amdgpu_ring *ring);
> >>> -void amdgpu_fence_driver_guilty_force_completion(struct amdgpu_fence 
> >>> *fence);
> >>> +void amdgpu_fence_driver_guilty_force_completion(struct amdgpu_fence 
> >>> *af);
> >>>  void amdgpu_fence_save_wptr(struct dma_fence *fence);
> >>>
> >>>  int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring);
> >>
>

Re: [PATCH 1/2] drm/amdgpu: set an error on all fences from a bad context

Reply via email to