[AMD Official Use Only - Internal Distribution Only]

On Fri, Aug 7, 2020 at 5:32 PM Li, Dennis <dennis...@amd.com> wrote:
>
> [AMD Public Use]
>
> On Fri, Aug 7, 2020 at 1:59 PM Li, Dennis <dennis...@amd.com> wrote:
> >
> > [AMD Public Use]
> >
> > Hi, Daniel,
> >       Thanks your review. And I also understand your concern. I guess you 
> > missed the description of this issue, so I paste it again in the below, and 
> > explain why this issue happens.
> >
> >       For example, in a XGMI system with 2 GPU devices whose device entity 
> > is named adev. And each adev has a separated reset_sem.
> >       // device init function
> >       void device_init(adev) {
> >         ...
> >         init_rwsem(&adev->reset_sem);
> >               ...
> >       }
> >
> >      XGMI system have two GPU devices, so system will call device_init 
> > twice. However the definition of init_rwsem macro has a limitation which 
> > use a local static lock_class_key to initialize rw_sem, which cause each 
> > adev->reset_sem share the same lock_class_key.
> >
> >      #define init_rwsem(sem)                                                
> >  \
> >      do {                                                               \
> >          static struct lock_class_key __key;                    \
> >                                                                 \
> >         __init_rwsem((sem), #sem, &__key);                      \
> >      } while (0)
> >
> >      And when GPU hang, we should do gpu recovery for all devices in the 
> > hive. Therefore we should lock each adev->reset_sem to protect GPU from be 
> > accessed by other threads during recovery. The detailed recovery sequence 
> > as the following:
> >      // Step1: lock all GPU firstly:
> >      for each adev of GPU0 or GPU1 {
> >            down_write(adev->reset_sem);
> >            do_suspend(adev);
> >     }
> >
> >     // Step2:
> >     do_hive_recovery(hive);
> >
> >     // Step3: resume and unlock GPU
> >     for each adev of GPU0 or GPU1 {
> >           do_resume(adev)
> >           up_write(adev->reset_sem);
> >     }
> >
> >     Each adev->reset_sem has the same lock_class_key, and lockdep will take 
> > them as the same rw_semaphore object. Therefore in step1, when lock GPU1, 
> > lockdep will pop the below warning.
> >
> >     I have considered your proposal (using  _nest_lock() ) before. Just as 
> > you said, _nest_lock() will hide true positive recursive locking. So I gave 
> > up it in the end.
> >
> >     After reviewing codes of lockdep, I found the lockdep support 
> > dynamic_key, so using separated lock_class_key has no any side effect. In 
> > fact, init_rwsem also use dynamic_key. Please see the call sequence of 
> > init_rwsem and lockdep_set_class as the below:
> >    1) init_rwsem -> __init_rwsem -> lockdep_init_map;
> >    2) lockdep_set_class -> lockdep_init_map;
> >
> >     Finally we go back to your concern, you maybe worry this change will 
> > cause the below dead-lock can't be detected. In fact, lockdep still is able 
> > to detect the issue as circular locking dependency, but there is no warning 
> > "recursive locking " in this case.
> >     Thread A: down_write(adev->reset_sem) for GPU 0 -> 
> > down_write(adev->reset_sem) for GPU 1 -> ... -> up_write(adev->reset_sem) 
> > for GPU 1 -> up_write(adev->reset_sem) for GPU 0
> >     Thread B: down_write(adev->reset_sem) for GPU 1 ->
> > down_write(adev->reset_sem) for GPU 0 -> ... ->
> > up_write(adev->reset_sem) for GPU 0 -> up_write(adev->reset_sem) for 
> > GPU 1
> >
> >     But lockdep still can detect recursive locking for this case:
> >     Thread A: down_write(adev->reset_sem) for GPU 0 -> ...-> ...->
> > down_write(adev->reset_sem) for GPU 0
>
> Yeah, I guessed correctly what you're doing. My recommendation to use 
> down_write_nest_lock still stands. This is for reset only, kernel-wide reset 
> lock really shouldn't hurt. Or make it a lock per xgmi hive, I'm assuming 
> that information is known somewhere.
>
> The problem with manual lockdep annotations is that they increase complexity. 
> You have to keep all the annotations in mind, including justifcations and 
> which parts they still catch and which parts they don't catch. And there's 
> zero performance justification for a gpu reset path to create some fancy 
> lockdep special cases.
>
> Locking needs to be
> 1. maintainable, i.e. every time you need to write a multi-paragraph 
> explanation like the above it's probably not. This obviously includes 
> correctness, but it's even more important that people can easily understand 
> what you're doing.
> 2. fast enough, where it matters. gpu reset just doesn't.
>
> [Dennis Li] Yeah. I strongly agree that manual lockdep annotation will 
> increase complexity. However my patch isn't for lockdep annotation, and in 
> fact it is to fix a bug. According to design of lockdep, every lock object 
> should has a separated  lock_class_key which is an unified ID of lock object 
> in lockdep.

Nope that 's not how lockdep works. It intentionally shares the lockdep class. 
If every lock would have it's own key all lockdep would achieve is produce a 
bit more heat and slow down your cpu.

Yes it uses the macro trick to make sure you share the lockdep key the right 
way in 99% of all cases, but there's also many other examples with manually 
shared lockdep keys. But shared lockdep keys is the entire point of lockdep. It 
doesn't do much useful without that.
-Daniel

[Dennis Li] Yes. I agree that using the shared lockdep class has higher CPU 
performance when lookdep opened. And 99% of all cases can work well, but our 
case belong to 1%.  You agree it as well that down_write_nest_lock has side 
effect and will increase complexity. Comparing two methods, I still think it is 
the best choice to use the separated class_key, because it has no side effect 
and performance drop is very little. 

> I still take the previous sample codes for example: we define two rw_sem: a 
> and b, and use case1 and case2 to init them.  In my opinion, case1 and case2 
> should have same behavior, but in fact, they are different, because case2 use 
> init_rwsem macro wrongly. Therefore we should lockdep_set_class to fix this 
> issue, such as case3.
>
> Case 1:
> init_rwsem(&a);
> init_rwsem(&b);
>
> Case2:
> void init_rwsem_ext(rw_sem* sem)
> {
>      init_rwsem(sem);
> }
> init_rwsem_ext(&a);
> init_rwsem_ext(&b);
>
> Case3:
> void init_rwsem_ext(rw_sem* sem, lock_class_key *key) {
>      init_rwsem(sem);
>      lockdep_set_class(sem, key);
> }
> init_rwsem_ext(&a, a_key);
> init_rwsem_ext(&b, b_key);
>
> > [  584.110304] ============================================
> > [  584.110590] WARNING: possible recursive locking detected
> > [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
> > [  584.111164] --------------------------------------------
> > [  584.111456] kworker/38:1/553 is trying to acquire lock:
> > [  584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at:
> > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.112112]
> >                but task is already holding lock:
> > [  584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at:
> > amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu] [  584.113068]
> >                other info that might help us debug this:
> > [  584.113689]  Possible unsafe locking scenario:
> >
> > [  584.114350]        CPU0
> > [  584.114685]        ----
> > [  584.115014]   lock(&adev->reset_sem);
> > [  584.115349]   lock(&adev->reset_sem);
> > [  584.115678]
> >                 *** DEADLOCK ***
> >
> > [  584.116624]  May be due to missing lock nesting notation
> >
> > [  584.117284] 4 locks held by kworker/38:1/553:
> > [  584.117616]  #0: ffff9ad635c1d348 ((wq_completion)events){+.+.},
> > at: process_one_work+0x21f/0x630 [  584.117967]  #1: 
> > ffffac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at:
> > process_one_work+0x21f/0x630 [  584.118358]  #2: ffffffffc1c2a5d0 
> > (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030
> > [amdgpu] [  584.118786]  #3: ffff9b1603d247a0 
> > (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030
> > [amdgpu]
>
> Uh, so this means this rwsem is in the critical path for dma-fence 
> signalling. Please make sure this is all correctly primed at driver load 
> (like we do alredy for dma_resv/fence and drm_modeset_lock) when lockdep is 
> enabled. Otherwise pretty much guaranteed you'll get this wrong somewhere.
> -Daniel
>
> >
> > Best Regards
> > Dennis Li
> > -----Original Message-----
> > From: Daniel Vetter <dan...@ffwll.ch>
> > Sent: Friday, August 7, 2020 5:45 PM
> > To: Koenig, Christian <christian.koe...@amd.com>
> > Cc: Li, Dennis <dennis...@amd.com>; amd-gfx@lists.freedesktop.org; 
> > Deucher, Alexander <alexander.deuc...@amd.com>; Kuehling, Felix 
> > <felix.kuehl...@amd.com>; Zhang, Hawking <hawking.zh...@amd.com>
> > Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive 
> > locking
> >
> > On Fri, Aug 7, 2020 at 11:34 AM Christian König <christian.koe...@amd.com> 
> > wrote:
> > >
> > > Am 07.08.20 um 11:23 schrieb Daniel Vetter:
> > > > On Fri, Aug 7, 2020 at 11:20 AM Daniel Vetter <dan...@ffwll.ch> wrote:
> > > >> On Fri, Aug 7, 2020 at 11:08 AM Christian König 
> > > >> <christian.koe...@amd.com> wrote:
> > > >>> [SNIP]
> > > >>>>>>> What we should do instead is to make sure we have only a single 
> > > >>>>>>> lock for the complete hive instead.
> > > >>>>>>> [Dennis Li] If we use a single lock, users will must wait for all 
> > > >>>>>>> devices resuming successfully, but in fact their tasks are only 
> > > >>>>>>> running in device a. It is not friendly to users.
> > > >>>>>> Well that is intentional I would say. We can only restart 
> > > >>>>>> submissions after all devices are resumed successfully, cause 
> > > >>>>>> otherwise it can happen that a job on device A depends on memory 
> > > >>>>>> on device B which isn't accessible.
> > > >>>>>> [Dennis Li] Yes, you are right. Driver have make sure that the 
> > > >>>>>> shared resources(such as the shard memory) are ready before unlock 
> > > >>>>>> the lock of adev one by one. But each device still has private 
> > > >>>>>> hardware resources such as video  and display.
> > > >>>>> Yeah, but those are negligible and we have a team working on 
> > > >>>>> display support for XGMI. So this will sooner or later become a 
> > > >>>>> problem as well.
> > > >>>>>
> > > >>>>> I still think that a single rwsem for the whole hive is still the 
> > > >>>>> best option here.
> > > >>>>>
> > > >>>>> [Dennis Li] For your above concern, we can open a new thread to 
> > > >>>>> discuss it. If we make a decision to use a single after discussing, 
> > > >>>>> we can create another patch for it.
> > > >>>> That's a really good argument, but I still hesitate to merge this 
> > > >>>> patch.
> > > >>>> How severe is the lockdep splat?
> > > >>>>
> > > >>>> At bare minimum we need a "/* TODO: ...." comment why we do this and 
> > > >>>> how to remove the workaround again.
> > > >>>> [Dennis Li] It is not a workaround. According to design of lockdep, 
> > > >>>> each rw_semaphore should has a separated lock_class_key. I have 
> > > >>>> explained that init_rwsem has the above limitation, so we must 
> > > >>>> correct it.
> > > >>> Yeah, but this explanation is not really sufficient. Again 
> > > >>> this is not an limitation of init_rwsem, but intentional behavior.
> > > >>>
> > > >>> In other words in most cases lockdep should splat if you use 
> > > >>> rw semaphores like this.
> > > >>>
> > > >>> That we avoid this by locking multiple amdgpu device always in 
> > > >>> the same order is rather questionable driver design.
> > > >> Yeah don't give multiple locking classes like that, that's 
> > > >> fairly terrible.
> > >
> > > Ok, that is exactly the answer I feared we would get.
> > >
> > > >> The right fix is mutex_lock_nest_lock, which we're using in 
> > > >> ww_mutex or in which is also used in mm_take_all_locks.
> > >
> > > Ah! Enlightenment! So in this case we should use 
> > > down_write_nested() in this special space and all is well?
> >
> > Nope. There's two kinds of nesting that lockdep supports:
> > 1. _nested() variants, where you supply an integer parameter specifying the 
> > lockdep subclass. That's like setting a locking subclass at lock creating 
> > time, except much worse because you change the lockdep class at runtime. 
> > This can lead to stuff like:
> >
> > struct mutex A;
> > mutex_init(&A);
> > mutex_lock(&A);
> > mutex_lock_nested(&A, SINGLE_DEPTH_NESTING);
> >
> > which is a very clear deadlock, but lockdep will think it's all fine.
> > Don't use that.
> >
> > 2. _nest_lock() variants, where you specify a super-lock, which makes sure 
> > that all sub-locks cannot be acquired concurrently (or you promise there's 
> > some other trick to avoid deadlocks). Lockdep checks that you've acquired 
> > the super-lock, and then allows arbitary nesting.
> > This is what ww_mutex uses, and what mm_take_all_locks uses. If the 
> > super-lock is an actual mutex, then this is actually deadlock free.
> > With ww_mutex it's a fake lockdep_map in the ww_acquire_ctx, but we 
> > guarantee deadlock-freeness through the ww algorithm. The fake super lock 
> > for ww_mutex is acquired when you get the ticket.
> >
> > No 2 is the thing you want, not no 1.
> >
> > > >>
> > > >> If you solve the locking ordering by sorting all the locks you 
> > > >> need (ugh), then you can also just use a fake lockdep_map for 
> > > >> your special function only.
> > > >>
> > > >> Also in general the idea of an rwsem in conjunction with xgmi 
> > > >> cross device memory handling just plain freaks me out :-) 
> > > >> Please make sure you prime this lock against dma_resv and 
> > > >> dma_fence (and maybe also drm_modeset_lock if this lock is 
> > > >> outside of dma_resv), and ideally this should only ever be used 
> > > >> for setup stuff when loading drivers. I guess that's why you 
> > > >> went with a per-device rwsem. If it's really only for setup 
> > > >> then just do the simple thing of having an 
> > > >> xgmi_hive_reconfigure lock, which all the rwsems nest within, and no 
> > > >> fancy lock ordering or other tricks.
> > >
> > > Well this is for the group reset of all devices in a hive and you 
> > > need to reboot to change the order the device are in the list.
> >
> > Ok, that sounds save enough. I'd go with a xgmi_hive_lock then, and 
> > down_write_nest_lock(&adev->xgmi_rwsem, &xgmi_hive_lock); Taking the 
> > global lock for device reset shouldn't be a problem, we're not going 
> > to reset multiple devices in parallel anyway. Or at least I don't 
> > hope that use-case is a perf critical benchmark for you guys :-)
> >
> > Cheers, Daniel
> >
> > > Regards,
> > > Christian.
> > >
> > > >>
> > > >>>>    Core network driver (net/core/dev.c) has the similar use case 
> > > >>>> with us.
> > > >>> Just took a look at that, but this is completely different use 
> > > >>> case. The lockdep classes are grouped by driver type to note 
> > > >>> the difference in the network stack.
> > > >>>
> > > >>> A quick search showed that no other device driver is doing 
> > > >>> this in the kernel, so I'm not sure if such a behavior 
> > > >>> wouldn't be rejected. Adding Daniel to comment on this as well.
> > > >>>
> > > >>> Felix had some really good arguments to make that an urgent 
> > > >>> fix, so either we come up with some real rework of this or at 
> > > >>> least add a big
> > > >>> "/* TODO....*/".
> > > >> Aside from "I have questions" I don't think there's any reason 
> > > >> to no fix this correctly with mutex_lock_nest_lock. Really 
> > > >> shouldn't be a semantic change from what I'm understand here.
> > > > Also one more with my drm maintainer hat on, as a heads-up:
> > > > Dave&me looked at i915 gem code a bit too much, and we're 
> > > > appalled at the massive over-use of lockdep class hacks and 
> > > > dubious nesting annotations. Expect crack down on anyone who 
> > > > tries to play clever tricks here, we have a few too many already 
> > > > :-)
> > > >
> > > > So no mutex_lock_nested (totally different beast from 
> > > > mutex_lock_nest_lock, and really unsafe since it's a runtime 
> > > > change of the lockdep class) and also not any lockdep_set_class 
> > > > trickery or anything like that. We need locking hierarchies 
> > > > which mere humans can comprehend and review.
> > > >
> > > > Cheers, Daniel
> > > >
> > > >> Cheers, Daniel
> > > >>
> > > >>> Regards,
> > > >>> Christian.
> > > >>>
> > > >>>> Regards,
> > > >>>> Christian.
> > > >>>>
> > > >>
> > > >> --
> > > >> Daniel Vetter
> > > >> Software Engineer, Intel Corporation 
> > > >> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%
> > > >> 2F
> > > >> bl
> > > >> og.ffwll.ch%2F&amp;data=02%7C01%7CDennis.Li%40amd.com%7C604aa31
> > > >> bd
> > > >> 86
> > > >> 4459226ee08d83ab684dd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C
> > > >> 0%
> > > >> 7C
> > > >> 637323902882042827&amp;sdata=DieZpl%2BIJ73STTkfn9UXDXmtssPXKKbt
> > > >> Ps
> > > >> ut
> > > >> BRp%2FS%2BE%3D&amp;reserved=0
> > > >
> > > >
> > >
> >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation 
> > https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> > ffwll.ch%2F&amp;data=02%7C01%7CDennis.Li%40amd.com%7Cc9876eefce8e4bb
> > 96
> > 31008d83acd7273%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6373240
> > 01 
> > 349319231&amp;sdata=%2BFmRWSamXiOE4HgK8VcTfFEINf31DqMqA5DjClsRkY4%3D
> > &a
> > mp;reserved=0
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.
> ffwll.ch%2F&amp;data=02%7C01%7CDennis.Li%40amd.com%7C784bbc9b31b043917
> cdd08d83aed8cb5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637324139
> 228476856&amp;sdata=xIzqs2EysUjcN0jNx4MtD5%2FW9lyMghwXZKfqnauUGXA%3D&a
> mp;reserved=0



--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=02%7C01%7CDennis.Li%40amd.com%7C784bbc9b31b043917cdd08d83aed8cb5%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637324139228476856&amp;sdata=xIzqs2EysUjcN0jNx4MtD5%2FW9lyMghwXZKfqnauUGXA%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to