On Mon, May 11, 2026 at 6:16 AM Rob Clark <[email protected]> wrote: > > On Mon, May 11, 2026 at 12:37 AM Boris Brezillon > <[email protected]> wrote: > > > > Hi, > > > > On Sat, 9 May 2026 08:34:15 -0700 > > Rob Clark <[email protected]> wrote: > > > > > On Thu, May 7, 2026 at 11:57 PM Daniel J Blueman <[email protected]> wrote: > > > > > > > > With PROVE_LOCKING on an Snapdragon X1 and VM reclaim pressure, we see: > > > > > > > > """ > > > > kswapd0/121 is trying to acquire lock: > > > > ffff800080ed3800 (reservation_ww_class_acquire){+.+.}-{0:0}, at: > > > > msm_gem_shrinker_scan (drivers/gpu/drm/msm/msm_gem_shrinker.c:189) > > > > > > > > but task is already holding lock: > > > > ffffbf4ddb44ca40 (fs_reclaim){+.+.}-{0:0}, at: > > > > balance_pgdat (mm/vmscan.c:7236 (discriminator 2)) > > > > > > > > which lock already depends on the new lock. > > > > > > > > the existing dependency chain (in reverse order) is: > > > > > > > > -> #2 (fs_reclaim){+.+.}-{0:0}: > > > > lock_acquire (kernel/locking/lockdep.c:5868 > > > > kernel/locking/lockdep.c:5825) > > > > fs_reclaim_acquire (mm/page_alloc.c:4325 mm/page_alloc.c:4339) > > > > dma_resv_lockdep (drivers/dma-buf/dma-resv.c:798) > > > > do_one_initcall (init/main.c:1392) > > > > kernel_init_freeable (init/main.c:1454 (discriminator 1) > > > > init/main.c:1470 > > > > (discriminator 1) init/main.c:1490 (discriminator 1) init/main.c:1703 > > > > (discriminator 1)) > > > > kernel_init (init/main.c:1593) > > > > ret_from_fork (arch/arm64/kernel/entry.S:858) > > > > > > > > -> #1 (reservation_ww_class_mutex){+.+.}-{4:4}: > > > > lock_acquire (kernel/locking/lockdep.c:5868 > > > > kernel/locking/lockdep.c:5825) > > > > dma_resv_lockdep (./include/linux/ww_mutex.h:164 (discriminator 1) > > > > drivers/dma-buf/dma-resv.c:791 (discriminator 1)) > > > > do_one_initcall (init/main.c:1392) > > > > kernel_init_freeable (init/main.c:1454 (discriminator 1) > > > > init/main.c:1470 > > > > (discriminator 1) init/main.c:1490 (discriminator 1) init/main.c:1703 > > > > (discriminator 1)) > > > > kernel_init (init/main.c:1593) > > > > ret_from_fork (arch/arm64/kernel/entry.S:858) > > > > > > > > -> #0 (reservation_ww_class_acquire){+.+.}-{0:0}: > > > > check_prev_add (kernel/locking/lockdep.c:3165) > > > > __lock_acquire (kernel/locking/lockdep.c:3284 > > > > kernel/locking/lockdep.c:3908 kernel/locking/lockdep.c:5237) > > > > lock_acquire (kernel/locking/lockdep.c:5868 > > > > kernel/locking/lockdep.c:5825) > > > > drm_gem_lru_scan (./include/linux/ww_mutex.h:163 (discriminator 1) > > > > drivers/gpu/drm/drm_gem.c:1681 (discriminator 1)) > > > > > > Your line #s don't quite match mine, but AFAICT this is from the > > > ww_acquire_init() > > > > > > What I'm unsure about is if this could cause live-lock against another > > > operation which requires obtaining both obj and vm locks in a > > > potentially different order (which would also be using a > > > ww_acquire_ctx ticket to backoff in case of conflicting locking > > > order). It wouldn't deadlock because we don't sleep forever if we do > > > sleep, but... > > > > > > Possibly we should also be using trylock to also acquire the vm lock, > > > but lockdep would still complain as it doesn't know the ticket will be > > > only used w/ trylock (unless we did something hacky by using a > > > different ww_class?) > > > > FWIW, we started using a ticket in the initial version of the Panthor > > shrinker, and ditched it at some point because of these unsolvable lock > > ordering issues. It also seems to me that trylock-all-the-way is the > > right solution, and if we trylock and back off + immediately move to > > the next BO if any lock is already held, the ticket approach is not as > > useful, because we're not going to use the retry mechanism provided by > > ww_mutex anyway. > > > > It's true that it does the bookkeeping, which simplifies the rollback > > procedure, but if you look at the other locks taken in the shrinker > > path, they are static (one per-component involved in reclaim) for most > > of them, meaning the rollback is pretty straightforward. The only > > exception is the VM lock (one per vm_bo in case of shared BOs). In > > panthor, we just decided to open-code this rollback logic (see > > panthor_gem_try_evict_no_resv_wait() [1]) instead of teaching ww_mutex > > about non-blocking locks when a ticket is provided. Not saying this is > > the best option, but it works... > > hmm, ok.. and if we block waiting for BO that is before grabbing the > vm locks so no live-lock concern (sry, been a while since I've looked > at this).. So I guess trylock is maybe the better approach. In this > case we should drop the ticket arg (basically revert commit > 02070f049875 ("drm/gem: Add ww_acquire_ctx support to > drm_gem_lru_scan()"))
I'll pull this into msm-fixes. I've fixed a compile error with the original patch (and replaced the lockdep splat with something legible... somehow whitespace was swallowed with the original). For next cycle I'll send a cleanup to drop the now unused ticket arg for drm_gem_lru_scan() BR, -R > BR, > -R > > > Regards, > > > > Boris > > > > [1]https://gitlab.freedesktop.org/drm/misc/kernel/-/blob/drm-misc-next/drivers/gpu/drm/panthor/panthor_gem.c?ref_type=heads#L1425
