Le 30/10/2025 à 12:17, Philipp Stanner a écrit :
On Wed, 2025-10-29 at 10:11 +0100, Pierre-Eric Pelloux-Prayer wrote:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908 pointed out

This link should be moved to the tag section at the bottom at a Closes:
tag. Optionally a Reported-by:, too.

The bug report is about a different issue. The potential deadlock being fixed by this patch was discovered while investigating it.
I'll add a Reported-by tag though.


a possible deadlock:

[ 1231.611031]  Possible interrupt unsafe locking scenario:

[ 1231.611033]        CPU0                    CPU1
[ 1231.611034]        ----                    ----
[ 1231.611035]   lock(&xa->xa_lock#17);
[ 1231.611038]                                local_irq_disable();
[ 1231.611039]                                lock(&fence->lock);
[ 1231.611041]                                lock(&xa->xa_lock#17);
[ 1231.611044]   <Interrupt>
[ 1231.611045]     lock(&fence->lock);
[ 1231.611047]
                 *** DEADLOCK ***


The commit message is lacking an explanation as to _how_ and _when_ the
deadlock comes to be. That's a prerequisite for understanding why the
below is the proper fix and solution.

I copy-pasted a small chunk of the full deadlock analysis report included in the ticket because it's 300+ lines long. Copying the full log isn't useful IMO, but I can add more context.

The problem is that a thread (CPU0 above) can lock the job's dependencies xa_array without disabling the interrupts. If a fence signals while CPU0 holds this lock and drm_sched_entity_kill_jobs_cb is called, it will try to grab the xa_array lock which is not possible because CPU0 holds it already.



The issue seems to be that you cannot perform certain tasks from within
that work item?

My initial fix was to replace xa_erase by xa_erase_irq, but Christian
pointed out that calling dma_fence_add_callback from a callback can
also deadlock if the signalling fence and the one passed to
dma_fence_add_callback share the same lock.

To fix both issues, the code iterating on dependencies and re-arming them
is moved out to drm_sched_entity_kill_jobs_work.

Suggested-by: Christian König <[email protected]>
Signed-off-by: Pierre-Eric Pelloux-Prayer <[email protected]>
---
  drivers/gpu/drm/scheduler/sched_entity.c | 34 +++++++++++++-----------
  1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index c8e949f4a568..fe174a4857be 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -173,26 +173,15 @@ int drm_sched_entity_error(struct drm_sched_entity 
*entity)
  }
  EXPORT_SYMBOL(drm_sched_entity_error);
+static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
+                                         struct dma_fence_cb *cb);
+
  static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
  {
        struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
-
-       drm_sched_fence_scheduled(job->s_fence, NULL);
-       drm_sched_fence_finished(job->s_fence, -ESRCH);
-       WARN_ON(job->s_fence->parent);
-       job->sched->ops->free_job(job);

Can free_job() really not be called from within work item context?

It's still called from drm_sched_entity_kill_jobs_work but the diff is slightly confusing.

Pierre-Eric




P.

Reply via email to