[AMD Official Use Only - Internal Distribution Only] On Fri, Aug 7, 2020 at 5:32 PM Li, Dennis <dennis...@amd.com> wrote: > > [AMD Public Use] > > On Fri, Aug 7, 2020 at 1:59 PM Li, Dennis <dennis...@amd.com> wrote: > > > > [AMD Public Use] > > > > Hi, Daniel, > > Thanks your review. And I also understand your concern. I guess you > > missed the description of this issue, so I paste it again in the below, and > > explain why this issue happens. > > > > For example, in a XGMI system with 2 GPU devices whose device entity > > is named adev. And each adev has a separated reset_sem. > > // device init function > > void device_init(adev) { > > ... > > init_rwsem(&adev->reset_sem); > > ... > > } > > > > XGMI system have two GPU devices, so system will call device_init > > twice. However the definition of init_rwsem macro has a limitation which > > use a local static lock_class_key to initialize rw_sem, which cause each > > adev->reset_sem share the same lock_class_key. > > > > #define init_rwsem(sem) > > \ > > do { \ > > static struct lock_class_key __key; \ > > \ > > __init_rwsem((sem), #sem, &__key); \ > > } while (0) > > > > And when GPU hang, we should do gpu recovery for all devices in the > > hive. Therefore we should lock each adev->reset_sem to protect GPU from be > > accessed by other threads during recovery. The detailed recovery sequence > > as the following: > > // Step1: lock all GPU firstly: > > for each adev of GPU0 or GPU1 { > > down_write(adev->reset_sem); > > do_suspend(adev); > > } > > > > // Step2: > > do_hive_recovery(hive); > > > > // Step3: resume and unlock GPU > > for each adev of GPU0 or GPU1 { > > do_resume(adev) > > up_write(adev->reset_sem); > > } > > > > Each adev->reset_sem has the same lock_class_key, and lockdep will take > > them as the same rw_semaphore object. Therefore in step1, when lock GPU1, > > lockdep will pop the below warning. > > > > I have considered your proposal (using _nest_lock() ) before. Just as > > you said, _nest_lock() will hide true positive recursive locking. So I gave > > up it in the end. > > > > After reviewing codes of lockdep, I found the lockdep support > > dynamic_key, so using separated lock_class_key has no any side effect. In > > fact, init_rwsem also use dynamic_key. Please see the call sequence of > > init_rwsem and lockdep_set_class as the below: > > 1) init_rwsem -> __init_rwsem -> lockdep_init_map; > > 2) lockdep_set_class -> lockdep_init_map; > > > > Finally we go back to your concern, you maybe worry this change will > > cause the below dead-lock can't be detected. In fact, lockdep still is able > > to detect the issue as circular locking dependency, but there is no warning > > "recursive locking " in this case. > > Thread A: down_write(adev->reset_sem) for GPU 0 -> > > down_write(adev->reset_sem) for GPU 1 -> ... -> up_write(adev->reset_sem) > > for GPU 1 -> up_write(adev->reset_sem) for GPU 0 > > Thread B: down_write(adev->reset_sem) for GPU 1 -> > > down_write(adev->reset_sem) for GPU 0 -> ... -> > > up_write(adev->reset_sem) for GPU 0 -> up_write(adev->reset_sem) for > > GPU 1 > > > > But lockdep still can detect recursive locking for this case: > > Thread A: down_write(adev->reset_sem) for GPU 0 -> ...-> ...-> > > down_write(adev->reset_sem) for GPU 0 > > Yeah, I guessed correctly what you're doing. My recommendation to use > down_write_nest_lock still stands. This is for reset only, kernel-wide reset > lock really shouldn't hurt. Or make it a lock per xgmi hive, I'm assuming > that information is known somewhere. > > The problem with manual lockdep annotations is that they increase complexity. > You have to keep all the annotations in mind, including justifcations and > which parts they still catch and which parts they don't catch. And there's > zero performance justification for a gpu reset path to create some fancy > lockdep special cases. > > Locking needs to be > 1. maintainable, i.e. every time you need to write a multi-paragraph > explanation like the above it's probably not. This obviously includes > correctness, but it's even more important that people can easily understand > what you're doing. > 2. fast enough, where it matters. gpu reset just doesn't. > > [Dennis Li] Yeah. I strongly agree that manual lockdep annotation will > increase complexity. However my patch isn't for lockdep annotation, and in > fact it is to fix a bug. According to design of lockdep, every lock object > should has a separated lock_class_key which is an unified ID of lock object > in lockdep.
Nope that 's not how lockdep works. It intentionally shares the lockdep class. If every lock would have it's own key all lockdep would achieve is produce a bit more heat and slow down your cpu. Yes it uses the macro trick to make sure you share the lockdep key the right way in 99% of all cases, but there's also many other examples with manually shared lockdep keys. But shared lockdep keys is the entire point of lockdep. It doesn't do much useful without that. -Daniel [Dennis Li] Yes. I agree that using the shared lockdep class has higher CPU performance when lookdep opened. And 99% of all cases can work well, but our case belong to 1%. You agree it as well that down_write_nest_lock has side effect and will increase complexity. Comparing two methods, I still think it is the best choice to use the separated class_key, because it has no side effect and performance drop is very little. > I still take the previous sample codes for example: we define two rw_sem: a > and b, and use case1 and case2 to init them. In my opinion, case1 and case2 > should have same behavior, but in fact, they are different, because case2 use > init_rwsem macro wrongly. Therefore we should lockdep_set_class to fix this > issue, such as case3. > > Case 1: > init_rwsem(&a); > init_rwsem(&b); > > Case2: > void init_rwsem_ext(rw_sem* sem) > { > init_rwsem(sem); > } > init_rwsem_ext(&a); > init_rwsem_ext(&b); > > Case3: > void init_rwsem_ext(rw_sem* sem, lock_class_key *key) { > init_rwsem(sem); > lockdep_set_class(sem, key); > } > init_rwsem_ext(&a, a_key); > init_rwsem_ext(&b, b_key); > > > [ 584.110304] ============================================ > > [ 584.110590] WARNING: possible recursive locking detected > > [ 584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G OE > > [ 584.111164] -------------------------------------------- > > [ 584.111456] kworker/38:1/553 is trying to acquire lock: > > [ 584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at: > > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.112112] > > but task is already holding lock: > > [ 584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at: > > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [ 584.113068] > > other info that might help us debug this: > > [ 584.113689] Possible unsafe locking scenario: > > > > [ 584.114350] CPU0 > > [ 584.114685] ---- > > [ 584.115014] lock(&adev->reset_sem); > > [ 584.115349] lock(&adev->reset_sem); > > [ 584.115678] > > *** DEADLOCK *** > > > > [ 584.116624] May be due to missing lock nesting notation > > > > [ 584.117284] 4 locks held by kworker/38:1/553: > > [ 584.117616] #0: ffff9ad635c1d348 ((wq_completion)events){+.+.}, > > at: process_one_work+0x21f/0x630 [ 584.117967] #1: > > ffffac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: > > process_one_work+0x21f/0x630 [ 584.118358] #2: ffffffffc1c2a5d0 > > (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 > > [amdgpu] [ 584.118786] #3: ffff9b1603d247a0 > > (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 > > [amdgpu] > > Uh, so this means this rwsem is in the critical path for dma-fence > signalling. Please make sure this is all correctly primed at driver load > (like we do alredy for dma_resv/fence and drm_modeset_lock) when lockdep is > enabled. Otherwise pretty much guaranteed you'll get this wrong somewhere. > -Daniel > > > > > Best Regards > > Dennis Li > > -----Original Message----- > > From: Daniel Vetter <dan...@ffwll.ch> > > Sent: Friday, August 7, 2020 5:45 PM > > To: Koenig, Christian <christian.koe...@amd.com> > > Cc: Li, Dennis <dennis...@amd.com>; amd-gfx@lists.freedesktop.org; > > Deucher, Alexander <alexander.deuc...@amd.com>; Kuehling, Felix > > <felix.kuehl...@amd.com>; Zhang, Hawking <hawking.zh...@amd.com> > > Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive > > locking > > > > On Fri, Aug 7, 2020 at 11:34 AM Christian König <christian.koe...@amd.com> > > wrote: > > > > > > Am 07.08.20 um 11:23 schrieb Daniel Vetter: > > > > On Fri, Aug 7, 2020 at 11:20 AM Daniel Vetter <dan...@ffwll.ch> wrote: > > > >> On Fri, Aug 7, 2020 at 11:08 AM Christian König > > > >> <christian.koe...@amd.com> wrote: > > > >>> [SNIP] > > > >>>>>>> What we should do instead is to make sure we have only a single > > > >>>>>>> lock for the complete hive instead. > > > >>>>>>> [Dennis Li] If we use a single lock, users will must wait for all > > > >>>>>>> devices resuming successfully, but in fact their tasks are only > > > >>>>>>> running in device a. It is not friendly to users. > > > >>>>>> Well that is intentional I would say. We can only restart > > > >>>>>> submissions after all devices are resumed successfully, cause > > > >>>>>> otherwise it can happen that a job on device A depends on memory > > > >>>>>> on device B which isn't accessible. > > > >>>>>> [Dennis Li] Yes, you are right. Driver have make sure that the > > > >>>>>> shared resources(such as the shard memory) are ready before unlock > > > >>>>>> the lock of adev one by one. But each device still has private > > > >>>>>> hardware resources such as video and display. > > > >>>>> Yeah, but those are negligible and we have a team working on > > > >>>>> display support for XGMI. So this will sooner or later become a > > > >>>>> problem as well. > > > >>>>> > > > >>>>> I still think that a single rwsem for the whole hive is still the > > > >>>>> best option here. > > > >>>>> > > > >>>>> [Dennis Li] For your above concern, we can open a new thread to > > > >>>>> discuss it. If we make a decision to use a single after discussing, > > > >>>>> we can create another patch for it. > > > >>>> That's a really good argument, but I still hesitate to merge this > > > >>>> patch. > > > >>>> How severe is the lockdep splat? > > > >>>> > > > >>>> At bare minimum we need a "/* TODO: ...." comment why we do this and > > > >>>> how to remove the workaround again. > > > >>>> [Dennis Li] It is not a workaround. According to design of lockdep, > > > >>>> each rw_semaphore should has a separated lock_class_key. I have > > > >>>> explained that init_rwsem has the above limitation, so we must > > > >>>> correct it. > > > >>> Yeah, but this explanation is not really sufficient. Again > > > >>> this is not an limitation of init_rwsem, but intentional behavior. > > > >>> > > > >>> In other words in most cases lockdep should splat if you use > > > >>> rw semaphores like this. > > > >>> > > > >>> That we avoid this by locking multiple amdgpu device always in > > > >>> the same order is rather questionable driver design. > > > >> Yeah don't give multiple locking classes like that, that's > > > >> fairly terrible. > > > > > > Ok, that is exactly the answer I feared we would get. > > > > > > >> The right fix is mutex_lock_nest_lock, which we're using in > > > >> ww_mutex or in which is also used in mm_take_all_locks. > > > > > > Ah! Enlightenment! So in this case we should use > > > down_write_nested() in this special space and all is well? > > > > Nope. There's two kinds of nesting that lockdep supports: > > 1. _nested() variants, where you supply an integer parameter specifying the > > lockdep subclass. That's like setting a locking subclass at lock creating > > time, except much worse because you change the lockdep class at runtime. > > This can lead to stuff like: > > > > struct mutex A; > > mutex_init(&A); > > mutex_lock(&A); > > mutex_lock_nested(&A, SINGLE_DEPTH_NESTING); > > > > which is a very clear deadlock, but lockdep will think it's all fine. > > Don't use that. > > > > 2. _nest_lock() variants, where you specify a super-lock, which makes sure > > that all sub-locks cannot be acquired concurrently (or you promise there's > > some other trick to avoid deadlocks). Lockdep checks that you've acquired > > the super-lock, and then allows arbitary nesting. > > This is what ww_mutex uses, and what mm_take_all_locks uses. If the > > super-lock is an actual mutex, then this is actually deadlock free. > > With ww_mutex it's a fake lockdep_map in the ww_acquire_ctx, but we > > guarantee deadlock-freeness through the ww algorithm. The fake super lock > > for ww_mutex is acquired when you get the ticket. > > > > No 2 is the thing you want, not no 1. > > > > > >> > > > >> If you solve the locking ordering by sorting all the locks you > > > >> need (ugh), then you can also just use a fake lockdep_map for > > > >> your special function only. > > > >> > > > >> Also in general the idea of an rwsem in conjunction with xgmi > > > >> cross device memory handling just plain freaks me out :-) > > > >> Please make sure you prime this lock against dma_resv and > > > >> dma_fence (and maybe also drm_modeset_lock if this lock is > > > >> outside of dma_resv), and ideally this should only ever be used > > > >> for setup stuff when loading drivers. I guess that's why you > > > >> went with a per-device rwsem. If it's really only for setup > > > >> then just do the simple thing of having an > > > >> xgmi_hive_reconfigure lock, which all the rwsems nest within, and no > > > >> fancy lock ordering or other tricks. > > > > > > Well this is for the group reset of all devices in a hive and you > > > need to reboot to change the order the device are in the list. > > > > Ok, that sounds save enough. I'd go with a xgmi_hive_lock then, and > > down_write_nest_lock(&adev->xgmi_rwsem, &xgmi_hive_lock); Taking the > > global lock for device reset shouldn't be a problem, we're not going > > to reset multiple devices in parallel anyway. Or at least I don't > > hope that use-case is a perf critical benchmark for you guys :-) > > > > Cheers, Daniel > > > > > Regards, > > > Christian. > > > > > > >> > > > >>>> Core network driver (net/core/dev.c) has the similar use case > > > >>>> with us. > > > >>> Just took a look at that, but this is completely different use > > > >>> case. The lockdep classes are grouped by driver type to note > > > >>> the difference in the network stack. > > > >>> > > > >>> A quick search showed that no other device driver is doing > > > >>> this in the kernel, so I'm not sure if such a behavior > > > >>> wouldn't be rejected. Adding Daniel to comment on this as well. > > > >>> > > > >>> Felix had some really good arguments to make that an urgent > > > >>> fix, so either we come up with some real rework of this or at > > > >>> least add a big > > > >>> "/* TODO....*/". > > > >> Aside from "I have questions" I don't think there's any reason > > > >> to no fix this correctly with mutex_lock_nest_lock. Really > > > >> shouldn't be a semantic change from what I'm understand here. > > > > Also one more with my drm maintainer hat on, as a heads-up: > > > > Dave&me looked at i915 gem code a bit too much, and we're > > > > appalled at the massive over-use of lockdep class hacks and > > > > dubious nesting annotations. Expect crack down on anyone who > > > > tries to play clever tricks here, we have a few too many already > > > > :-) > > > > > > > > So no mutex_lock_nested (totally different beast from > > > > mutex_lock_nest_lock, and really unsafe since it's a runtime > > > > change of the lockdep class) and also not any lockdep_set_class > > > > trickery or anything like that. We need locking hierarchies > > > > which mere humans can comprehend and review. > > > > > > > > Cheers, Daniel > > > > > > > >> Cheers, Daniel > > > >> > > > >>> Regards, > > > >>> Christian. > > > >>> > > > >>>> Regards, > > > >>>> Christian. > > > >>>> > > > >> > > > >> -- > > > >> Daniel Vetter > > > >> Software Engineer, Intel Corporation > > > >> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F% > > > >> 2F > > > >> bl > > > >> og.ffwll.ch%2F&data=02%7C01%7CDennis.Li%40amd.com%7C604aa31 > > > >> bd > > > >> 86 > > > >> 4459226ee08d83ab684dd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C > > > >> 0% > > > >> 7C > > > >> 637323902882042827&sdata=DieZpl%2BIJ73STTkfn9UXDXmtssPXKKbt > > > >> Ps > > > >> ut > > > >> BRp%2FS%2BE%3D&reserved=0 > > > > > > > > > > > > > > > > > -- > > Daniel Vetter > > Software Engineer, Intel Corporation > > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog. > > ffwll.ch%2F&data=02%7C01%7CDennis.Li%40amd.com%7Cc9876eefce8e4bb > > 96 > > 31008d83acd7273%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6373240 > > 01 > > 349319231&sdata=%2BFmRWSamXiOE4HgK8VcTfFEINf31DqMqA5DjClsRkY4%3D > > &a > > mp;reserved=0 > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog. > ffwll.ch%2F&data=02%7C01%7CDennis.Li%40amd.com%7C784bbc9b31b043917 > cdd08d83aed8cb5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637324139 > 228476856&sdata=xIzqs2EysUjcN0jNx4MtD5%2FW9lyMghwXZKfqnauUGXA%3D&a > mp;reserved=0 -- Daniel Vetter Software Engineer, Intel Corporation https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&data=02%7C01%7CDennis.Li%40amd.com%7C784bbc9b31b043917cdd08d83aed8cb5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637324139228476856&sdata=xIzqs2EysUjcN0jNx4MtD5%2FW9lyMghwXZKfqnauUGXA%3D&reserved=0 _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx