[AMD Public Use]

Hi, Daniel,
      Thanks your review. And I also understand your concern. I guess you 
missed the description of this issue, so I paste it again in the below, and 
explain why this issue happens.

      For example, in a XGMI system with 2 GPU devices whose device entity is 
named adev. And each adev has a separated reset_sem. 
      // device init function 
      void device_init(adev) {
        ...
        init_rwsem(&adev->reset_sem);
              ...
      }

     XGMI system have two GPU devices, so system will call device_init twice. 
However the definition of init_rwsem macro has a limitation which use a local 
static lock_class_key to initialize rw_sem, which cause each adev->reset_sem 
share the same lock_class_key. 

     #define init_rwsem(sem)                                                 \ 
     do {                                                               \
         static struct lock_class_key __key;                    \
                                                                \
        __init_rwsem((sem), #sem, &__key);                      \
     } while (0)

     And when GPU hang, we should do gpu recovery for all devices in the hive. 
Therefore we should lock each adev->reset_sem to protect GPU from be accessed 
by other threads during recovery. The detailed recovery sequence as the 
following:
     // Step1: lock all GPU firstly:
     for each adev of GPU0 or GPU1 {
           down_write(adev->reset_sem); 
           do_suspend(adev); 
    }

    // Step2:
    do_hive_recovery(hive);

    // Step3: resume and unlock GPU
    for each adev of GPU0 or GPU1 {
          do_resume(adev)
          up_write(adev->reset_sem);
    }

    Each adev->reset_sem has the same lock_class_key, and lockdep will take 
them as the same rw_semaphore object. Therefore in step1, when lock GPU1, 
lockdep will pop the below warning. 

    I have considered your proposal (using  _nest_lock() ) before. Just as you 
said, _nest_lock() will hide true positive recursive locking. So I gave up it 
in the end. 

    After reviewing codes of lockdep, I found the lockdep support dynamic_key, 
so using separated lock_class_key has no any side effect. In fact, init_rwsem 
also use dynamic_key. Please see the call sequence of init_rwsem and 
lockdep_set_class as the below:
   1) init_rwsem -> __init_rwsem -> lockdep_init_map; 
   2) lockdep_set_class -> lockdep_init_map;

    Finally we go back to your concern, you maybe worry this change will cause 
the below dead-lock can't be detected. In fact, lockdep still is able to detect 
the issue as circular locking dependency, but there is no warning "recursive 
locking " in this case. 
    Thread A: down_write(adev->reset_sem) for GPU 0 -> 
down_write(adev->reset_sem) for GPU 1 -> ... -> up_write(adev->reset_sem) for 
GPU 1 -> up_write(adev->reset_sem) for GPU 0
    Thread B: down_write(adev->reset_sem) for GPU 1 -> 
down_write(adev->reset_sem) for GPU 0 -> ... -> up_write(adev->reset_sem) for 
GPU 0 -> up_write(adev->reset_sem) for GPU 1

    But lockdep still can detect recursive locking for this case:
    Thread A: down_write(adev->reset_sem) for GPU 0 -> ...-> ...-> 
down_write(adev->reset_sem) for GPU 0 

[  584.110304] ============================================
[  584.110590] WARNING: possible recursive locking detected
[  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
[  584.111164] --------------------------------------------
[  584.111456] kworker/38:1/553 is trying to acquire lock:
[  584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.112112]
               but task is already holding lock:
[  584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
[  584.113068]
               other info that might help us debug this:
[  584.113689]  Possible unsafe locking scenario:

[  584.114350]        CPU0
[  584.114685]        ----
[  584.115014]   lock(&adev->reset_sem);
[  584.115349]   lock(&adev->reset_sem);
[  584.115678]
                *** DEADLOCK ***

[  584.116624]  May be due to missing lock nesting notation

[  584.117284] 4 locks held by kworker/38:1/553:
[  584.117616]  #0: ffff9ad635c1d348 ((wq_completion)events){+.+.}, at: 
process_one_work+0x21f/0x630
[  584.117967]  #1: ffffac708e1c3e58 
((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630
[  584.118358]  #2: ffffffffc1c2a5d0 (&tmp->hive_lock){+.+.}, at: 
amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu]
[  584.118786]  #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at: 
amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]

Best Regards
Dennis Li
-----Original Message-----
From: Daniel Vetter <dan...@ffwll.ch> 
Sent: Friday, August 7, 2020 5:45 PM
To: Koenig, Christian <christian.koe...@amd.com>
Cc: Li, Dennis <dennis...@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander <alexander.deuc...@amd.com>; Kuehling, Felix 
<felix.kuehl...@amd.com>; Zhang, Hawking <hawking.zh...@amd.com>
Subject: Re: [PATCH] drm/amdgpu: annotate a false positive recursive locking

On Fri, Aug 7, 2020 at 11:34 AM Christian König <christian.koe...@amd.com> 
wrote:
>
> Am 07.08.20 um 11:23 schrieb Daniel Vetter:
> > On Fri, Aug 7, 2020 at 11:20 AM Daniel Vetter <dan...@ffwll.ch> wrote:
> >> On Fri, Aug 7, 2020 at 11:08 AM Christian König 
> >> <christian.koe...@amd.com> wrote:
> >>> [SNIP]
> >>>>>>> What we should do instead is to make sure we have only a single lock 
> >>>>>>> for the complete hive instead.
> >>>>>>> [Dennis Li] If we use a single lock, users will must wait for all 
> >>>>>>> devices resuming successfully, but in fact their tasks are only 
> >>>>>>> running in device a. It is not friendly to users.
> >>>>>> Well that is intentional I would say. We can only restart submissions 
> >>>>>> after all devices are resumed successfully, cause otherwise it can 
> >>>>>> happen that a job on device A depends on memory on device B which 
> >>>>>> isn't accessible.
> >>>>>> [Dennis Li] Yes, you are right. Driver have make sure that the shared 
> >>>>>> resources(such as the shard memory) are ready before unlock the lock 
> >>>>>> of adev one by one. But each device still has private hardware 
> >>>>>> resources such as video  and display.
> >>>>> Yeah, but those are negligible and we have a team working on display 
> >>>>> support for XGMI. So this will sooner or later become a problem as well.
> >>>>>
> >>>>> I still think that a single rwsem for the whole hive is still the best 
> >>>>> option here.
> >>>>>
> >>>>> [Dennis Li] For your above concern, we can open a new thread to discuss 
> >>>>> it. If we make a decision to use a single after discussing, we can 
> >>>>> create another patch for it.
> >>>> That's a really good argument, but I still hesitate to merge this patch.
> >>>> How severe is the lockdep splat?
> >>>>
> >>>> At bare minimum we need a "/* TODO: ...." comment why we do this and how 
> >>>> to remove the workaround again.
> >>>> [Dennis Li] It is not a workaround. According to design of lockdep, each 
> >>>> rw_semaphore should has a separated lock_class_key. I have explained 
> >>>> that init_rwsem has the above limitation, so we must correct it.
> >>> Yeah, but this explanation is not really sufficient. Again this is 
> >>> not an limitation of init_rwsem, but intentional behavior.
> >>>
> >>> In other words in most cases lockdep should splat if you use rw 
> >>> semaphores like this.
> >>>
> >>> That we avoid this by locking multiple amdgpu device always in the 
> >>> same order is rather questionable driver design.
> >> Yeah don't give multiple locking classes like that, that's fairly 
> >> terrible.
>
> Ok, that is exactly the answer I feared we would get.
>
> >> The right fix is mutex_lock_nest_lock, which we're using in 
> >> ww_mutex or in which is also used in mm_take_all_locks.
>
> Ah! Enlightenment! So in this case we should use down_write_nested() 
> in this special space and all is well?

Nope. There's two kinds of nesting that lockdep supports:
1. _nested() variants, where you supply an integer parameter specifying the 
lockdep subclass. That's like setting a locking subclass at lock creating time, 
except much worse because you change the lockdep class at runtime. This can 
lead to stuff like:

struct mutex A;
mutex_init(&A);
mutex_lock(&A);
mutex_lock_nested(&A, SINGLE_DEPTH_NESTING);

which is a very clear deadlock, but lockdep will think it's all fine.
Don't use that.

2. _nest_lock() variants, where you specify a super-lock, which makes sure that 
all sub-locks cannot be acquired concurrently (or you promise there's some 
other trick to avoid deadlocks). Lockdep checks that you've acquired the 
super-lock, and then allows arbitary nesting.
This is what ww_mutex uses, and what mm_take_all_locks uses. If the super-lock 
is an actual mutex, then this is actually deadlock free.
With ww_mutex it's a fake lockdep_map in the ww_acquire_ctx, but we guarantee 
deadlock-freeness through the ww algorithm. The fake super lock for ww_mutex is 
acquired when you get the ticket.

No 2 is the thing you want, not no 1.

> >>
> >> If you solve the locking ordering by sorting all the locks you need 
> >> (ugh), then you can also just use a fake lockdep_map for your 
> >> special function only.
> >>
> >> Also in general the idea of an rwsem in conjunction with xgmi cross 
> >> device memory handling just plain freaks me out :-) Please make 
> >> sure you prime this lock against dma_resv and dma_fence (and maybe 
> >> also drm_modeset_lock if this lock is outside of dma_resv), and 
> >> ideally this should only ever be used for setup stuff when loading 
> >> drivers. I guess that's why you went with a per-device rwsem. If 
> >> it's really only for setup then just do the simple thing of having 
> >> an xgmi_hive_reconfigure lock, which all the rwsems nest within, 
> >> and no fancy lock ordering or other tricks.
>
> Well this is for the group reset of all devices in a hive and you need 
> to reboot to change the order the device are in the list.

Ok, that sounds save enough. I'd go with a xgmi_hive_lock then, and 
down_write_nest_lock(&adev->xgmi_rwsem, &xgmi_hive_lock); Taking the global 
lock for device reset shouldn't be a problem, we're not going to reset multiple 
devices in parallel anyway. Or at least I don't hope that use-case is a perf 
critical benchmark for you guys :-)

Cheers, Daniel

> Regards,
> Christian.
>
> >>
> >>>>    Core network driver (net/core/dev.c) has the similar use case with us.
> >>> Just took a look at that, but this is completely different use 
> >>> case. The lockdep classes are grouped by driver type to note the 
> >>> difference in the network stack.
> >>>
> >>> A quick search showed that no other device driver is doing this in 
> >>> the kernel, so I'm not sure if such a behavior wouldn't be 
> >>> rejected. Adding Daniel to comment on this as well.
> >>>
> >>> Felix had some really good arguments to make that an urgent fix, 
> >>> so either we come up with some real rework of this or at least add 
> >>> a big
> >>> "/* TODO....*/".
> >> Aside from "I have questions" I don't think there's any reason to 
> >> no fix this correctly with mutex_lock_nest_lock. Really shouldn't 
> >> be a semantic change from what I'm understand here.
> > Also one more with my drm maintainer hat on, as a heads-up: Dave&me 
> > looked at i915 gem code a bit too much, and we're appalled at the 
> > massive over-use of lockdep class hacks and dubious nesting 
> > annotations. Expect crack down on anyone who tries to play clever 
> > tricks here, we have a few too many already :-)
> >
> > So no mutex_lock_nested (totally different beast from 
> > mutex_lock_nest_lock, and really unsafe since it's a runtime change 
> > of the lockdep class) and also not any lockdep_set_class trickery or 
> > anything like that. We need locking hierarchies which mere humans 
> > can comprehend and review.
> >
> > Cheers, Daniel
> >
> >> Cheers, Daniel
> >>
> >>> Regards,
> >>> Christian.
> >>>
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>
> >> --
> >> Daniel Vetter
> >> Software Engineer, Intel Corporation
> >> https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fbl
> >> og.ffwll.ch%2F&amp;data=02%7C01%7CDennis.Li%40amd.com%7C604aa31bd86
> >> 4459226ee08d83ab684dd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C
> >> 637323902882042827&amp;sdata=DieZpl%2BIJ73STTkfn9UXDXmtssPXKKbtPsut
> >> BRp%2FS%2BE%3D&amp;reserved=0
> >
> >
>


--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=02%7C01%7CDennis.Li%40amd.com%7C604aa31bd864459226ee08d83ab684dd%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637323902882042827&amp;sdata=DieZpl%2BIJ73STTkfn9UXDXmtssPXKKbtPsutBRp%2FS%2BE%3D&amp;reserved=0
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to