Re: Lockdep spalt on killing a processes

Andrey Grodzovsky Thu, 28 Oct 2021 10:26:41 -0700


On 2021-10-27 3:58 p.m., Andrey Grodzovsky wrote:

On 2021-10-27 10:50 a.m., Christian König wrote:
Am 27.10.21 um 16:47 schrieb Andrey Grodzovsky:
On 2021-10-27 10:34 a.m., Christian König wrote:
Am 27.10.21 um 16:27 schrieb Andrey Grodzovsky:
[SNIP]
Let me please know if I am still missing some point of yours.
Well, I mean we need to be able to handle this for all drivers.
For sure, but as i said above in my opinion we need to change onlyfor those drivers that don't use the _locked version.
And that absolutely won't work.
See the dma_fence is a contract between drivers, so you need thesame calling convention between all drivers.
Either we always call the callback with the lock held or we alwayscall it without the lock, but sometimes like that and sometimesotherwise won't work.
Christian.
I am not sure I fully understand what problems this will cause butanyway, then we are back to irq_work. We cannot embed irq_work asunion within dma_fenc's cb_listbecause it's already reused as timestamp and as rcu head after thefence is signaled. So I will do it within drm_scheduler with singleirq_work per drm_sched_entity
as we discussed before.
That won't work either. We free up the entity after the cleanupfunction. That's the reason we use the callback on the job in thefirst place.
Yep, missed it.
We could overlead the cb structure in the job though.
I guess, since no one else is using this member it after the cb executed.

Andrey

Attached a patch. Give it a try please, I tested it on my side and triedto generate the right conditions to trigger this code path by repeatedlysubmitting commands while issuing GPU reset to stop the scheduler andthen killing command submissions process in the middle. But for somereason looks like the job_queue was always empty already at the time ofentity kill.


Andrey


Christian.


Andrey


Andrey

>From 8ba5c089939b79a6567411c33d4db40e5846eef3 Mon Sep 17 00:00:00 2001
From: Andrey Grodzovsky <andrey.grodzov...@amd.com>
Date: Thu, 28 Oct 2021 12:24:03 -0400
Subject: [PATCH] drm/sched: Avoid lockdep spalt on killing a processes
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Probelm:
Singlaning one sched fence from within another's sched
fence singal callback generates lockdep splat because
the both have same lockdep class of their fence->lock

Fix:
Fix bellow stack by rescheduling to irq work of
signaling and killing of jobs that left when entity is killed.

[11176.741181]  dump_stack+0x10/0x12
[11176.741186] __lock_acquire.cold+0x208/0x2df
[11176.741197]  lock_acquire+0xc6/0x2d0
[11176.741204]  ? dma_fence_signal+0x28/0x80
[11176.741212] _raw_spin_lock_irqsave+0x4d/0x70
[11176.741219]  ? dma_fence_signal+0x28/0x80
[11176.741225]  dma_fence_signal+0x28/0x80
[11176.741230] drm_sched_fence_finished+0x12/0x20 [gpu_sched]
[11176.741240] drm_sched_entity_kill_jobs_cb+0x1c/0x50 [gpu_sched]
[11176.741248] dma_fence_signal_timestamp_locked+0xac/0x1a0
[11176.741254]  dma_fence_signal+0x3b/0x80
[11176.741260] drm_sched_fence_finished+0x12/0x20 [gpu_sched]
[11176.741268] drm_sched_job_done.isra.0+0x7f/0x1a0 [gpu_sched]
[11176.741277] drm_sched_job_done_cb+0x12/0x20 [gpu_sched]
[11176.741284] dma_fence_signal_timestamp_locked+0xac/0x1a0
[11176.741290]  dma_fence_signal+0x3b/0x80
[11176.741296] amdgpu_fence_process+0xd1/0x140 [amdgpu]
[11176.741504] sdma_v4_0_process_trap_irq+0x8c/0xb0 [amdgpu]
[11176.741731]  amdgpu_irq_dispatch+0xce/0x250 [amdgpu]
[11176.741954]  amdgpu_ih_process+0x81/0x100 [amdgpu]
[11176.742174]  amdgpu_irq_handler+0x26/0xa0 [amdgpu]
[11176.742393] __handle_irq_event_percpu+0x4f/0x2c0
[11176.742402] handle_irq_event_percpu+0x33/0x80
[11176.742408]  handle_irq_event+0x39/0x60
[11176.742414]  handle_edge_irq+0x93/0x1d0
[11176.742419]  __common_interrupt+0x50/0xe0
[11176.742426]  common_interrupt+0x80/0x90

Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
Suggested-by: Daniel Vetter  <daniel.vet...@ffwll.ch>
Suggested-by: Christian König <christian.koe...@amd.com>
---
 drivers/gpu/drm/scheduler/sched_entity.c | 15 ++++++++++++---
 include/drm/gpu_scheduler.h              | 12 +++++++++++-
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 27e1573af96e..191c56064f19 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -190,6 +190,16 @@ long drm_sched_entity_flush(struct drm_sched_entity *entity, long timeout)
 }
 EXPORT_SYMBOL(drm_sched_entity_flush);
 
+static void drm_sched_entity_kill_jobs_irq_work(struct irq_work *wrk)
+{
+	struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
+
+	drm_sched_fence_finished(job->s_fence);
+	WARN_ON(job->s_fence->parent);
+	job->sched->ops->free_job(job);
+}
+
+
 /* Signal the scheduler finished fence when the entity in question is killed. */
 static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
 					  struct dma_fence_cb *cb)
@@ -197,9 +207,8 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
 	struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
 						 finish_cb);
 
-	drm_sched_fence_finished(job->s_fence);
-	WARN_ON(job->s_fence->parent);
-	job->sched->ops->free_job(job);
+	init_irq_work(&job->work, drm_sched_entity_kill_jobs_irq_work);
+	irq_work_queue(&job->work);
 }
 
 static struct dma_fence *
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index f011e4c407f2..bbc22fad8d80 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -28,6 +28,7 @@
 #include <linux/dma-fence.h>
 #include <linux/completion.h>
 #include <linux/xarray.h>
+#include <linux/irq_work.h>
 
 #define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
 
@@ -286,7 +287,16 @@ struct drm_sched_job {
 	struct list_head		list;
 	struct drm_gpu_scheduler	*sched;
 	struct drm_sched_fence		*s_fence;
-	struct dma_fence_cb		finish_cb;
+
+	/*
+	 * work is used only after finish_cb has been used and will not be
+	 * accessed anymore.
+	 */
+	union {
+		struct dma_fence_cb		finish_cb;
+		struct irq_work 		work;
+	};
+
 	uint64_t			id;
 	atomic_t			karma;
 	enum drm_sched_priority		s_priority;
-- 
2.25.1

Re: Lockdep spalt on killing a processes

Reply via email to