[AMD Public Use] Hi, Daniel, Thanks your review. And I also understand your concern. I guess you missed the description of this issue, so I paste it again in the below, and explain why this issue happens.
For example, in a XGMI system with 2 GPU devices whose device entity is named adev. And each adev has a separated reset_sem. // device init function void device_init(adev) { ... init_rwsem(&adev->reset_sem); ... } XGMI system have two GPU devices, so system will call device_init twice. However the definition of init_rwsem macro has a limitation which use a local static lock_class_key to initialize rw_sem, which cause each adev->reset_sem share the same lock_class_key. #define init_rwsem(sem) \ do { \ static struct lock_class_key __key; \ \ __init_rwsem((sem), #sem, &__key); \ } while (0) And when GPU hang, we should do gpu recovery for all devices in the hive. Therefore we should lock each adev->reset_sem to protect GPU from be accessed by other threads during recovery. The detailed recovery sequence as the following: // Step1: lock all GPU firstly: for each adev of GPU0 or GPU1 { down_write(adev->reset_sem); do_suspend(adev); } // Step2: do_hive_recovery(hive); // Step3: resume and unlock GPU for each adev of GPU0 or GPU1 { do_resume(adev) up_write(adev->reset_sem); } Each adev->reset_sem has the same lock_class_key, and lockdep will take them as the same rw_semaphore object. Therefore in step1, when lock GPU1, lockdep will pop the below warning. I have considered your proposal (using _nest_lock() ) before. Just as you said, _nest_lock() will hide true positive recursive locking. So I gave up it in the end. After reviewing codes of lockdep, I found the lockdep support dynamic_key, so using separated lock_class_key has no any side effect. In fact, init_rwsem also use dynamic_key. Please see the call sequence of init_rwsem and lockdep_set_class as the below: 1) init_rwsem -> __init_rwsem -> lockdep_init_map; 2) lockdep_set_class -> lockdep_init_map; Finally we go back to your concern, you maybe worry this change will cause the below dead-lock can't be detected. In fact, lockdep still is able to detect the issue as circular locking dependency, but there is no warning "recursive locking " in this case. Thread A: down_write(adev->reset_sem) for GPU 0 -> down_write(adev->reset_sem) for GPU 1 -> ... -> up_write(adev->reset_sem) for GPU 1 -> up_write(adev->reset_sem) for GPU 0 Thread B: down_write(adev->reset_sem) for GPU 1 -> down_write(adev->reset_sem) for GPU 0 -> ... -> up_write(adev->reset_sem) for GPU 0 -> up_write(adev->reset_sem) for GPU 1 But lockdep still can detect recursive locking for this case: Thread A: down_write(adev->reset_sem) for GPU 0 -> ...-> ...-> down_write(adev->reset_sem) for GPU 0 [ 584.110304] ============================================ [ 584.110590] WARNING: possible recursive locking detected [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE [ 584.111164] -------------------------------------------- [ 584.111456] kworker/38:1/553 is trying to acquire lock: [ 584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] but task is already holding lock: [ 584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] other info that might help us debug this: [ 584.113689] Possible unsafe locking scenario: [ 584.114350] CPU0 [ 584.114685] ---- [ 584.115014] lock(&adev->reset_sem); [ 584.115349] lock(&adev->reset_sem); [ 584.115678] *** DEADLOCK *** [ 584.116624] May be due to missing lock nesting notation [ 584.117284] 4 locks held by kworker/38:1/553: [ 584.117616] #0: ffff9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630 [ 584.117967] #1: ffffac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630 [ 584.118358] #2: ffffffffc1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu] [ 584.118786] #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] Best Regards Dennis Li -----Original Message----- From: Daniel Vetter <dan...@ffwll.ch> Sent: Friday, August 7, 2020 5:45 PM To: Koenig, Christian <christian.koe...@amd.com> Cc: Li, Dennis <dennis...@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <alexander.deuc...@amd.com>; Kuehling, Felix <felix.kuehl...@amd.com>; Zhang, Hawking <hawking.zh...@amd.com> Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking On Fri, Aug 7, 2020 at 11:34 AM Christian König <christian.koe...@amd.com> wrote: > > Am 07.08.20 um 11:23 schrieb Daniel Vetter: > > On Fri, Aug 7, 2020 at 11:20 AM Daniel Vetter <dan...@ffwll.ch> wrote: > >> On Fri, Aug 7, 2020 at 11:08 AM Christian König > >> <christian.koe...@amd.com> wrote: > >>> [SNIP] > >>>>>>> What we should do instead is to make sure we have only a single lock > >>>>>>> for the complete hive instead. > >>>>>>> [Dennis Li] If we use a single lock, users will must wait for all > >>>>>>> devices resuming successfully, but in fact their tasks are only > >>>>>>> running in device a. It is not friendly to users. > >>>>>> Well that is intentional I would say. We can only restart submissions > >>>>>> after all devices are resumed successfully, cause otherwise it can > >>>>>> happen that a job on device A depends on memory on device B which > >>>>>> isn't accessible. > >>>>>> [Dennis Li] Yes, you are right. Driver have make sure that the shared > >>>>>> resources(such as the shard memory) are ready before unlock the lock > >>>>>> of adev one by one. But each device still has private hardware > >>>>>> resources such as video and display. > >>>>> Yeah, but those are negligible and we have a team working on display > >>>>> support for XGMI. So this will sooner or later become a problem as well. > >>>>> > >>>>> I still think that a single rwsem for the whole hive is still the best > >>>>> option here. > >>>>> > >>>>> [Dennis Li] For your above concern, we can open a new thread to discuss > >>>>> it. If we make a decision to use a single after discussing, we can > >>>>> create another patch for it. > >>>> That's a really good argument, but I still hesitate to merge this patch. > >>>> How severe is the lockdep splat? > >>>> > >>>> At bare minimum we need a "/* TODO: ...." comment why we do this and how > >>>> to remove the workaround again. > >>>> [Dennis Li] It is not a workaround. According to design of lockdep, each > >>>> rw_semaphore should has a separated lock_class_key. I have explained > >>>> that init_rwsem has the above limitation, so we must correct it. > >>> Yeah, but this explanation is not really sufficient. Again this is > >>> not an limitation of init_rwsem, but intentional behavior. > >>> > >>> In other words in most cases lockdep should splat if you use rw > >>> semaphores like this. > >>> > >>> That we avoid this by locking multiple amdgpu device always in the > >>> same order is rather questionable driver design. > >> Yeah don't give multiple locking classes like that, that's fairly > >> terrible. > > Ok, that is exactly the answer I feared we would get. > > >> The right fix is mutex_lock_nest_lock, which we're using in > >> ww_mutex or in which is also used in mm_take_all_locks. > > Ah! Enlightenment! So in this case we should use down_write_nested() > in this special space and all is well? Nope. There's two kinds of nesting that lockdep supports: 1. _nested() variants, where you supply an integer parameter specifying the lockdep subclass. That's like setting a locking subclass at lock creating time, except much worse because you change the lockdep class at runtime. This can lead to stuff like: struct mutex A; mutex_init(&A); mutex_lock(&A); mutex_lock_nested(&A, SINGLE_DEPTH_NESTING); which is a very clear deadlock, but lockdep will think it's all fine. Don't use that. 2. _nest_lock() variants, where you specify a super-lock, which makes sure that all sub-locks cannot be acquired concurrently (or you promise there's some other trick to avoid deadlocks). Lockdep checks that you've acquired the super-lock, and then allows arbitary nesting. This is what ww_mutex uses, and what mm_take_all_locks uses. If the super-lock is an actual mutex, then this is actually deadlock free. With ww_mutex it's a fake lockdep_map in the ww_acquire_ctx, but we guarantee deadlock-freeness through the ww algorithm. The fake super lock for ww_mutex is acquired when you get the ticket. No 2 is the thing you want, not no 1. > >> > >> If you solve the locking ordering by sorting all the locks you need > >> (ugh), then you can also just use a fake lockdep_map for your > >> special function only. > >> > >> Also in general the idea of an rwsem in conjunction with xgmi cross > >> device memory handling just plain freaks me out :-) Please make > >> sure you prime this lock against dma_resv and dma_fence (and maybe > >> also drm_modeset_lock if this lock is outside of dma_resv), and > >> ideally this should only ever be used for setup stuff when loading > >> drivers. I guess that's why you went with a per-device rwsem. If > >> it's really only for setup then just do the simple thing of having > >> an xgmi_hive_reconfigure lock, which all the rwsems nest within, > >> and no fancy lock ordering or other tricks. > > Well this is for the group reset of all devices in a hive and you need > to reboot to change the order the device are in the list. Ok, that sounds save enough. I'd go with a xgmi_hive_lock then, and down_write_nest_lock(&adev->xgmi_rwsem, &xgmi_hive_lock); Taking the global lock for device reset shouldn't be a problem, we're not going to reset multiple devices in parallel anyway. Or at least I don't hope that use-case is a perf critical benchmark for you guys :-) Cheers, Daniel > Regards, > Christian. > > >> > >>>> Core network driver (net/core/dev.c) has the similar use case with us. > >>> Just took a look at that, but this is completely different use > >>> case. The lockdep classes are grouped by driver type to note the > >>> difference in the network stack. > >>> > >>> A quick search showed that no other device driver is doing this in > >>> the kernel, so I'm not sure if such a behavior wouldn't be > >>> rejected. Adding Daniel to comment on this as well. > >>> > >>> Felix had some really good arguments to make that an urgent fix, > >>> so either we come up with some real rework of this or at least add > >>> a big > >>> "/* TODO....*/". > >> Aside from "I have questions" I don't think there's any reason to > >> no fix this correctly with mutex_lock_nest_lock. Really shouldn't > >> be a semantic change from what I'm understand here. > > Also one more with my drm maintainer hat on, as a heads-up: Dave&me > > looked at i915 gem code a bit too much, and we're appalled at the > > massive over-use of lockdep class hacks and dubious nesting > > annotations. Expect crack down on anyone who tries to play clever > > tricks here, we have a few too many already :-) > > > > So no mutex_lock_nested (totally different beast from > > mutex_lock_nest_lock, and really unsafe since it's a runtime change > > of the lockdep class) and also not any lockdep_set_class trickery or > > anything like that. We need locking hierarchies which mere humans > > can comprehend and review. > > > > Cheers, Daniel > > > >> Cheers, Daniel > >> > >>> Regards, > >>> Christian. > >>> > >>>> Regards, > >>>> Christian. > >>>> > >> > >> -- > >> Daniel Vetter > >> Software Engineer, Intel Corporation > >> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fbl > >> og.ffwll.ch%2F&data=02%7C01%7CDennis.Li%40amd.com%7C604aa31bd86 > >> 4459226ee08d83ab684dd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C > >> 637323902882042827&sdata=DieZpl%2BIJ73STTkfn9UXDXmtssPXKKbtPsut > >> BRp%2FS%2BE%3D&reserved=0 > > > > > -- Daniel Vetter Software Engineer, Intel Corporation https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=02%7C01%7CDennis.Li%40amd.com%7C604aa31bd864459226ee08d83ab684dd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637323902882042827&sdata=DieZpl%2BIJ73STTkfn9UXDXmtssPXKKbtPsutBRp%2FS%2BE%3D&reserved=0 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx