On Fri, Aug 7, 2020 at 11:20 AM Daniel Vetter <dan...@ffwll.ch> wrote: > > On Fri, Aug 7, 2020 at 11:08 AM Christian König > <christian.koe...@amd.com> wrote: > > > > [SNIP] > > >>>> What we should do instead is to make sure we have only a single lock > > >>>> for the complete hive instead. > > >>>> [Dennis Li] If we use a single lock, users will must wait for all > > >>>> devices resuming successfully, but in fact their tasks are only > > >>>> running in device a. It is not friendly to users. > > >>> Well that is intentional I would say. We can only restart submissions > > >>> after all devices are resumed successfully, cause otherwise it can > > >>> happen that a job on device A depends on memory on device B which isn't > > >>> accessible. > > >>> [Dennis Li] Yes, you are right. Driver have make sure that the shared > > >>> resources(such as the shard memory) are ready before unlock the lock of > > >>> adev one by one. But each device still has private hardware resources > > >>> such as video and display. > > >> Yeah, but those are negligible and we have a team working on display > > >> support for XGMI. So this will sooner or later become a problem as well. > > >> > > >> I still think that a single rwsem for the whole hive is still the best > > >> option here. > > >> > > >> [Dennis Li] For your above concern, we can open a new thread to discuss > > >> it. If we make a decision to use a single after discussing, we can > > >> create another patch for it. > > > That's a really good argument, but I still hesitate to merge this patch. > > > How severe is the lockdep splat? > > > > > > At bare minimum we need a "/* TODO: ...." comment why we do this and how > > > to remove the workaround again. > > > [Dennis Li] It is not a workaround. According to design of lockdep, each > > > rw_semaphore should has a separated lock_class_key. I have explained that > > > init_rwsem has the above limitation, so we must correct it. > > > > Yeah, but this explanation is not really sufficient. Again this is not > > an limitation of init_rwsem, but intentional behavior. > > > > In other words in most cases lockdep should splat if you use rw > > semaphores like this. > > > > That we avoid this by locking multiple amdgpu device always in the same > > order is rather questionable driver design. > > Yeah don't give multiple locking classes like that, that's fairly > terrible. The right fix is mutex_lock_nest_lock, which we're using in > ww_mutex or in which is also used in mm_take_all_locks. > > If you solve the locking ordering by sorting all the locks you need > (ugh), then you can also just use a fake lockdep_map for your special > function only. > > Also in general the idea of an rwsem in conjunction with xgmi cross > device memory handling just plain freaks me out :-) Please make sure > you prime this lock against dma_resv and dma_fence (and maybe also > drm_modeset_lock if this lock is outside of dma_resv), and ideally > this should only ever be used for setup stuff when loading drivers. I > guess that's why you went with a per-device rwsem. If it's really only > for setup then just do the simple thing of having an > xgmi_hive_reconfigure lock, which all the rwsems nest within, and no > fancy lock ordering or other tricks. > > > > Core network driver (net/core/dev.c) has the similar use case with us. > > > > Just took a look at that, but this is completely different use case. The > > lockdep classes are grouped by driver type to note the difference in the > > network stack. > > > > A quick search showed that no other device driver is doing this in the > > kernel, so I'm not sure if such a behavior wouldn't be rejected. Adding > > Daniel to comment on this as well. > > > > Felix had some really good arguments to make that an urgent fix, so > > either we come up with some real rework of this or at least add a big > > "/* TODO....*/". > > Aside from "I have questions" I don't think there's any reason to no > fix this correctly with mutex_lock_nest_lock. Really shouldn't be a > semantic change from what I'm understand here.
Also one more with my drm maintainer hat on, as a heads-up: Dave&me looked at i915 gem code a bit too much, and we're appalled at the massive over-use of lockdep class hacks and dubious nesting annotations. Expect crack down on anyone who tries to play clever tricks here, we have a few too many already :-) So no mutex_lock_nested (totally different beast from mutex_lock_nest_lock, and really unsafe since it's a runtime change of the lockdep class) and also not any lockdep_set_class trickery or anything like that. We need locking hierarchies which mere humans can comprehend and review. Cheers, Daniel > > Cheers, Daniel > > > > > Regards, > > Christian. > > > > > > > > Regards, > > > Christian. > > > > > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx