amdgpu: set an error on all fences from a bad context

Christian König Fri, 05 Sep 2025 00:23:05 -0700

On 04.09.25 16:45, Alex Deucher wrote:
> When we backup ring contents to reemit after a queue reset,
> we don't backup ring contents from the bad context.  When
> we signal the fences, we should set an error on those
> fences as well.
> 
> v2: misc cleanups
> 
> Fixes: 77cc0da39c7c ("drm/amdgpu: track ring state associated with a fence")
> Signed-off-by: Alex Deucher <alexander.deuc...@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 28 +++++++++++++++++++----
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c  |  2 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h  |  2 +-
>  3 files changed, 26 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index fd8cca241da62..1a2710f453551 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -758,11 +758,31 @@ void amdgpu_fence_driver_force_completion(struct 
> amdgpu_ring *ring)
>   * @fence: fence of the ring to signal
>   *
>   */
> -void amdgpu_fence_driver_guilty_force_completion(struct amdgpu_fence *fence)
> +void amdgpu_fence_driver_guilty_force_completion(struct amdgpu_fence *af)
>  {
> -     dma_fence_set_error(&fence->base, -ETIME);
> -     amdgpu_fence_write(fence->ring, fence->seq);
> -     amdgpu_fence_process(fence->ring);
> +     struct dma_fence *unprocessed;
> +     struct dma_fence __rcu **ptr;
> +     struct amdgpu_fence *fence;
> +     struct amdgpu_ring *ring = af->ring;
> +     u64 i, seqno;
> +
> +     seqno = amdgpu_fence_read(ring);
> +
> +     for (i = seqno + 1; i <= ring->fence_drv.sync_seq; ++i) {
> +             ptr = &ring->fence_drv.fences[i & 
> ring->fence_drv.num_fences_mask];
> +             rcu_read_lock();
> +             unprocessed = rcu_dereference(*ptr);
> +
> +             if (unprocessed && !dma_fence_is_signaled(unprocessed)) {
> +                     fence = container_of(unprocessed, struct amdgpu_fence, 
> base);
> +
> +                     if (fence->context == af->context)
> +                             dma_fence_set_error(&fence->base, -ETIME);


This now needs to be protected by the spinlock we use for signaling.

Otherwise it can be that dma_fence_is_signaled() return false, the fence 
signals in between and dma_fence_set_error() throws a warning.

Additional to that I think it would be better to use -TIME on the timed out 
fence and -ECANCELED on all others.

> +             }
> +             rcu_read_unlock();
> +     }
> +     amdgpu_fence_write(ring, af->seq);
> +     amdgpu_fence_process(ring);
>  }
>  
>  void amdgpu_fence_save_wptr(struct dma_fence *fence)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> index 6379bb25bf5ce..77ea1ef46236d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> @@ -812,7 +812,7 @@ int amdgpu_ring_reset_helper_end(struct amdgpu_ring *ring,
>       if (r)
>               return r;
>  
> -     /* signal the fence of the bad job */
> +     /* signal the fences of the bad job */

"Bad context" or even completely drop that comment. Should be obvious what 
happens and why is explained on the function doc IIRC.

>       if (guilty_fence)
>               amdgpu_fence_driver_guilty_force_completion(guilty_fence);
>       /* Re-emit the non-guilty commands */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> index 7670f5d82b9e4..0704fd9b85fdb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> @@ -155,7 +155,7 @@ extern const struct drm_sched_backend_ops 
> amdgpu_sched_ops;
>  void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring);
>  void amdgpu_fence_driver_set_error(struct amdgpu_ring *ring, int error);
>  void amdgpu_fence_driver_force_completion(struct amdgpu_ring *ring);
> -void amdgpu_fence_driver_guilty_force_completion(struct amdgpu_fence *fence);
> +void amdgpu_fence_driver_guilty_force_completion(struct amdgpu_fence *af);
>  void amdgpu_fence_save_wptr(struct dma_fence *fence);
>  
>  int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring);

Re: [PATCH 1/2] drm/amdgpu: set an error on all fences from a bad context

Reply via email to