RE: [PATCH] drm/amdgpu: annotate a false positive recursive locking

Li, Dennis Thu, 06 Aug 2020 02:16:04 -0700

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,
     For this case, it is safe to use separated lock key. Please see the 
definition of init_rwsem as the below. Every init_rwsem calling will use a new 
static key, but devices of  the hive share the same code to do initialization, 
so their lock_class_key are the same. I think it is a limitation of init_rwsem. 
 In our case, it should be correct that reset_sem of each adev has different  
lock_class_key. BTW, this change doesn't effect dead-lock detection and just 
correct it.


#define init_rwsem(sem)                                 \
do {                                                            \
        static struct lock_class_key __key;                     \
                                                                \
        __init_rwsem((sem), #sem, &__key);                      \
} while (0)

Best Regards
Dennis Li
-----Original Message-----
From: Koenig, Christian <christian.koe...@amd.com> 
Sent: Thursday, August 6, 2020 4:13 PM
To: Li, Dennis <dennis...@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander <alexander.deuc...@amd.com>; Kuehling, Felix 
<felix.kuehl...@amd.com>; Zhang, Hawking <hawking.zh...@amd.com>
Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking

Preventing locking problems during implementation is obviously a good approach, 
but lockdep has proven to be massively useful for finding and fixing problems.

Disabling lockdep splat by annotating lock with separate classes is usually a 
no-go and only allowed if there is no other potential approach.

In this case here we should really clean things up instead.

Christian.

Am 06.08.20 um 09:44 schrieb Li, Dennis:
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Christian,
>        I agree with your concern. However we shouldn't rely on system to 
> detect dead-lock, and should consider this when doing code implementation. In 
> fact, dead-lock detection isn't enabled by default.
>        For your proposal to remove reset_sem into the hive structure, we can 
> open a new topic to discuss it. Currently we couldn't make sure which is the 
> best solution. For example, with your proposal, we must wait for all devices 
> resuming successfully before resubmit an old task in one device, which will 
> effect performance.
>
> Best Regards
> Dennis Li
> -----Original Message-----
> From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of 
> Christian König
> Sent: Thursday, August 6, 2020 3:08 PM
> To: Li, Dennis <dennis...@amd.com>; amd-gfx@lists.freedesktop.org; 
> Deucher, Alexander <alexander.deuc...@amd.com>; Kuehling, Felix 
> <felix.kuehl...@amd.com>; Zhang, Hawking <hawking.zh...@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive 
> locking
>
> Am 06.08.20 um 09:02 schrieb Dennis Li:
>> [  584.110304] ============================================
>> [  584.110590] WARNING: possible recursive locking detected
>> [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
>> [  584.111164] --------------------------------------------
>> [  584.111456] kworker/38:1/553 is trying to acquire lock:
>> [  584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at:
>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
>>                  but task is already holding lock:
>> [  584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at:
>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
>>                  other info that might help us debug this:
>> [  584.113689]  Possible unsafe locking scenario:
>>
>> [  584.114350]        CPU0
>> [  584.114685]        ----
>> [  584.115014]   lock(&adev->reset_sem);
>> [  584.115349]   lock(&adev->reset_sem);
>> [  584.115678]
>>                   *** DEADLOCK ***
>>
>> [  584.116624]  May be due to missing lock nesting notation
>>
>> [  584.117284] 4 locks held by kworker/38:1/553:
>> [  584.117616]  #0: ffff9ad635c1d348 ((wq_completion)events){+.+.},
>> at: process_one_work+0x21f/0x630 [  584.117967]  #1: ffffac708e1c3e58 
>> ((work_completion)(&con->recovery_work)){+.+.}, at:
>> process_one_work+0x21f/0x630 [  584.118358]  #2: ffffffffc1c2a5d0 
>> (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] 
>> [  584.118786]  #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at: 
>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.119222]
>>                  stack backtrace:
>> [  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G  
>>          OE     5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
>> [  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, 
>> BIOS 3.1 05/23/2019 [  584.121223] Workqueue: events 
>> amdgpu_ras_do_recovery [amdgpu] [  584.121638] Call Trace:
>> [  584.122050]  dump_stack+0x98/0xd5
>> [  584.122499]  __lock_acquire+0x1139/0x16e0 [  584.122931]  ?
>> trace_hardirqs_on+0x3b/0xf0 [  584.123358]  ?
>> cancel_delayed_work+0xa6/0xc0 [  584.123771]  lock_acquire+0xb8/0x1c0 
>> [  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 
>> 584.124599]  down_write+0x49/0x120 [  584.125032]  ?
>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125472]
>> amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.125910]  ?
>> amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu] [  584.126367]
>> amdgpu_ras_do_recovery+0x159/0x190 [amdgpu] [  584.126789]
>> process_one_work+0x29e/0x630 [  584.127208]  worker_thread+0x3c/0x3f0 
>> [  584.127621]  ? __kthread_parkme+0x61/0x90 [  584.128014]
>> kthread+0x12f/0x150 [  584.128402]  ? process_one_work+0x630/0x630 [
>> 584.128790]  ? kthread_park+0x90/0x90 [  584.129174]
>> ret_from_fork+0x3a/0x50
>>
>> Each adev has owned lock_class_key to avoid false positive recursive 
>> locking.
> NAK, that is not a false positive but a real problem.
>
> The issue here is that we have multiple reset semaphores, one for each device 
> in the hive. If those are not acquired in the correct order we deadlock.
>
> The real solution would be to move the reset_sem into the hive structure and 
> make sure that we lock it only once.
>
> Christian.
>
>> Signed-off-by: Dennis Li <dennis...@amd.com>
>> Change-Id: I7571efeccbf15483982031d00504a353031a854a
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index e97c088d03b3..766dc8f8c8a0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -967,6 +967,7 @@ struct amdgpu_device {
>>      atomic_t                        in_gpu_reset;
>>      enum pp_mp1_state               mp1_state;
>>      struct rw_semaphore     reset_sem;
>> +    struct lock_class_key lock_key;
>>      struct amdgpu_doorbell_index doorbell_index;
>>    
>>      struct mutex                    notifier_lock;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 6c572db42d92..d78df9312d34 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3037,6 +3037,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>>      mutex_init(&adev->virt.vf_errors.lock);
>>      hash_init(adev->mn_hash);
>>      init_rwsem(&adev->reset_sem);
>> +    lockdep_set_class(&adev->reset_sem, &adev->lock_key);
>>      atomic_set(&adev->in_gpu_reset, 0);
>>      mutex_init(&adev->psp.mutex);
>>      mutex_init(&adev->notifier_lock);
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flist
> s.freedesktop.org%2Fmailman%2Flistinfo%2Famd-gfx&amp;data=02%7C01%7CDe
> nnis.Li%40amd.com%7C56c95f939ddd441bd10408d839d77c9e%7C3dd8961fe4884e6
> 08e11a82d994e183d%7C0%7C0%7C637322944985436656&amp;sdata=%2FBoRyEW3iK5
> 9Y52ctLWd4y1lOmi2CInb6lpIgAF88i4%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: annotate a false positive recursive locking

Reply via email to