[AMD Official Use Only - Internal Distribution Only] Hi, Christian, For this case, it is safe to use separated lock key. Please see the definition of init_rwsem as the below. Every init_rwsem calling will use a new static key, but devices of the hive share the same code to do initialization, so their lock_class_key are the same. I think it is a limitation of init_rwsem. In our case, it should be correct that reset_sem of each adev has different lock_class_key. BTW, this change doesn't effect dead-lock detection and just correct it.
#define init_rwsem(sem) \ do { \ static struct lock_class_key __key; \ \ __init_rwsem((sem), #sem, &__key); \ } while (0) Best Regards Dennis Li -----Original Message----- From: Koenig, Christian <christian.koe...@amd.com> Sent: Thursday, August 6, 2020 4:13 PM To: Li, Dennis <dennis...@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <alexander.deuc...@amd.com>; Kuehling, Felix <felix.kuehl...@amd.com>; Zhang, Hawking <hawking.zh...@amd.com> Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking Preventing locking problems during implementation is obviously a good approach, but lockdep has proven to be massively useful for finding and fixing problems. Disabling lockdep splat by annotating lock with separate classes is usually a no-go and only allowed if there is no other potential approach. In this case here we should really clean things up instead. Christian. Am 06.08.20 um 09:44 schrieb Li, Dennis: > [AMD Official Use Only - Internal Distribution Only] > > Hi, Christian, > I agree with your concern. However we shouldn't rely on system to > detect dead-lock, and should consider this when doing code implementation. In > fact, dead-lock detection isn't enabled by default. > For your proposal to remove reset_sem into the hive structure, we can > open a new topic to discuss it. Currently we couldn't make sure which is the > best solution. For example, with your proposal, we must wait for all devices > resuming successfully before resubmit an old task in one device, which will > effect performance. > > Best Regards > Dennis Li > -----Original Message----- > From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of > Christian König > Sent: Thursday, August 6, 2020 3:08 PM > To: Li, Dennis <dennis...@amd.com>; amd-gfx@lists.freedesktop.org; > Deucher, Alexander <alexander.deuc...@amd.com>; Kuehling, Felix > <felix.kuehl...@amd.com>; Zhang, Hawking <hawking.zh...@amd.com> > Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive > locking > > Am 06.08.20 um 09:02 schrieb Dennis Li: >> [ 584.110304] ============================================ >> [ 584.110590] WARNING: possible recursive locking detected >> [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE >> [ 584.111164] -------------------------------------------- >> [ 584.111456] kworker/38:1/553 is trying to acquire lock: >> [ 584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at: >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] >> but task is already holding lock: >> [ 584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at: >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] >> other info that might help us debug this: >> [ 584.113689] Possible unsafe locking scenario: >> >> [ 584.114350] CPU0 >> [ 584.114685] ---- >> [ 584.115014] lock(&adev->reset_sem); >> [ 584.115349] lock(&adev->reset_sem); >> [ 584.115678] >> *** DEADLOCK *** >> >> [ 584.116624] May be due to missing lock nesting notation >> >> [ 584.117284] 4 locks held by kworker/38:1/553: >> [ 584.117616] #0: ffff9ad635c1d348 ((wq_completion)events){+.+.}, >> at: process_one_work+0x21f/0x630 [ 584.117967] #1: ffffac708e1c3e58 >> ((work_completion)(&con->recovery_work)){+.+.}, at: >> process_one_work+0x21f/0x630 [ 584.118358] #2: ffffffffc1c2a5d0 >> (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] >> [ 584.118786] #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at: >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.119222] >> stack backtrace: >> [ 584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G >> OE 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 >> [ 584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, >> BIOS 3.1 05/23/2019 [ 584.121223] Workqueue: events >> amdgpu_ras_do_recovery [amdgpu] [ 584.121638] Call Trace: >> [ 584.122050] dump_stack+0x98/0xd5 >> [ 584.122499] __lock_acquire+0x1139/0x16e0 [ 584.122931] ? >> trace_hardirqs_on+0x3b/0xf0 [ 584.123358] ? >> cancel_delayed_work+0xa6/0xc0 [ 584.123771] lock_acquire+0xb8/0x1c0 >> [ 584.124197] ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ >> 584.124599] down_write+0x49/0x120 [ 584.125032] ? >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125472] >> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.125910] ? >> amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [ 584.126367] >> amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [ 584.126789] >> process_one_work+0x29e/0x630 [ 584.127208] worker_thread+0x3c/0x3f0 >> [ 584.127621] ? __kthread_parkme+0x61/0x90 [ 584.128014] >> kthread+0x12f/0x150 [ 584.128402] ? process_one_work+0x630/0x630 [ >> 584.128790] ? kthread_park+0x90/0x90 [ 584.129174] >> ret_from_fork+0x3a/0x50 >> >> Each adev has owned lock_class_key to avoid false positive recursive >> locking. > NAK, that is not a false positive but a real problem. > > The issue here is that we have multiple reset semaphores, one for each device > in the hive. If those are not acquired in the correct order we deadlock. > > The real solution would be to move the reset_sem into the hive structure and > make sure that we lock it only once. > > Christian. > >> Signed-off-by: Dennis Li <dennis...@amd.com> >> Change-Id: I7571efeccbf15483982031d00504a353031a854a >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> index e97c088d03b3..766dc8f8c8a0 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> @@ -967,6 +967,7 @@ struct amdgpu_device { >> atomic_t in_gpu_reset; >> enum pp_mp1_state mp1_state; >> struct rw_semaphore reset_sem; >> + struct lock_class_key lock_key; >> struct amdgpu_doorbell_index doorbell_index; >> >> struct mutex notifier_lock; >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> index 6c572db42d92..d78df9312d34 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -3037,6 +3037,7 @@ int amdgpu_device_init(struct amdgpu_device *adev, >> mutex_init(&adev->virt.vf_errors.lock); >> hash_init(adev->mn_hash); >> init_rwsem(&adev->reset_sem); >> + lockdep_set_class(&adev->reset_sem, &adev->lock_key); >> atomic_set(&adev->in_gpu_reset, 0); >> mutex_init(&adev->psp.mutex); >> mutex_init(&adev->notifier_lock); > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist > s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&data=02%7C01%7CDe > nnis.Li%40amd.com%7C56c95f939ddd441bd10408d839d77c9e%7C3dd8961fe4884e6 > 08e11a82d994e183d%7C0%7C0%7C637322944985436656&sdata=%2FBoRyEW3iK5 > 9Y52ctLWd4y1lOmi2CInb6lpIgAF88i4%3D&reserved=0 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx