On Mon, May 11, 2026 at 12:37 AM Boris Brezillon
<[email protected]> wrote:
>
> Hi,
>
> On Sat, 9 May 2026 08:34:15 -0700
> Rob Clark <[email protected]> wrote:
>
> > On Thu, May 7, 2026 at 11:57 PM Daniel J Blueman <[email protected]> wrote:
> > >
> > > With PROVE_LOCKING on an Snapdragon X1 and VM reclaim pressure, we see:
> > >
> > > """
> > > kswapd0/121 is trying to acquire lock:
> > > ffff800080ed3800 (reservation_ww_class_acquire){+.+.}-{0:0}, at:
> > >   msm_gem_shrinker_scan (drivers/gpu/drm/msm/msm_gem_shrinker.c:189)
> > >
> > > but task is already holding lock:
> > > ffffbf4ddb44ca40 (fs_reclaim){+.+.}-{0:0}, at:
> > >   balance_pgdat (mm/vmscan.c:7236 (discriminator 2))
> > >
> > > which lock already depends on the new lock.
> > >
> > > the existing dependency chain (in reverse order) is:
> > >
> > > -> #2 (fs_reclaim){+.+.}-{0:0}:
> > > lock_acquire (kernel/locking/lockdep.c:5868 kernel/locking/lockdep.c:5825)
> > > fs_reclaim_acquire (mm/page_alloc.c:4325 mm/page_alloc.c:4339)
> > > dma_resv_lockdep (drivers/dma-buf/dma-resv.c:798)
> > > do_one_initcall (init/main.c:1392)
> > > kernel_init_freeable (init/main.c:1454 (discriminator 1) init/main.c:1470
> > >   (discriminator 1) init/main.c:1490 (discriminator 1) init/main.c:1703
> > >   (discriminator 1))
> > > kernel_init (init/main.c:1593)
> > > ret_from_fork (arch/arm64/kernel/entry.S:858)
> > >
> > > -> #1 (reservation_ww_class_mutex){+.+.}-{4:4}:
> > > lock_acquire (kernel/locking/lockdep.c:5868 kernel/locking/lockdep.c:5825)
> > > dma_resv_lockdep (./include/linux/ww_mutex.h:164 (discriminator 1)
> > >   drivers/dma-buf/dma-resv.c:791 (discriminator 1))
> > > do_one_initcall (init/main.c:1392)
> > > kernel_init_freeable (init/main.c:1454 (discriminator 1) init/main.c:1470
> > >   (discriminator 1) init/main.c:1490 (discriminator 1) init/main.c:1703
> > >   (discriminator 1))
> > > kernel_init (init/main.c:1593)
> > > ret_from_fork (arch/arm64/kernel/entry.S:858)
> > >
> > > -> #0 (reservation_ww_class_acquire){+.+.}-{0:0}:
> > > check_prev_add (kernel/locking/lockdep.c:3165)
> > > __lock_acquire (kernel/locking/lockdep.c:3284
> > >   kernel/locking/lockdep.c:3908 kernel/locking/lockdep.c:5237)
> > > lock_acquire (kernel/locking/lockdep.c:5868 kernel/locking/lockdep.c:5825)
> > > drm_gem_lru_scan (./include/linux/ww_mutex.h:163 (discriminator 1)
> > >   drivers/gpu/drm/drm_gem.c:1681 (discriminator 1))
> >
> > Your line #s don't quite match mine, but AFAICT this is from the
> > ww_acquire_init()
> >
> > What I'm unsure about is if this could cause live-lock against another
> > operation which requires obtaining both obj and vm locks in a
> > potentially different order (which would also be using a
> > ww_acquire_ctx ticket to backoff in case of conflicting locking
> > order).  It wouldn't deadlock because we don't sleep forever if we do
> > sleep, but...
> >
> > Possibly we should also be using trylock to also acquire the vm lock,
> > but lockdep would still complain as it doesn't know the ticket will be
> > only used w/ trylock (unless we did something hacky by using a
> > different ww_class?)
>
> FWIW, we started using a ticket in the initial version of the Panthor
> shrinker, and ditched it at some point because of these unsolvable lock
> ordering issues. It also seems to me that trylock-all-the-way is the
> right solution, and if we trylock and back off + immediately move to
> the next BO if any lock is already held, the ticket approach is not as
> useful, because we're not going to use the retry mechanism provided by
> ww_mutex anyway.
>
> It's true that it does the bookkeeping, which simplifies the rollback
> procedure, but if you look at the other locks taken in the shrinker
> path, they are static (one per-component involved in reclaim) for most
> of them, meaning the rollback is pretty straightforward. The only
> exception is the VM lock (one per vm_bo in case of shared BOs). In
> panthor, we just decided to open-code this rollback logic (see
> panthor_gem_try_evict_no_resv_wait() [1]) instead of teaching ww_mutex
> about non-blocking locks when a ticket is provided. Not saying this is
> the best option, but it works...

hmm, ok.. and if we block waiting for BO that is before grabbing the
vm locks so no live-lock concern (sry, been a while since I've looked
at this)..  So I guess trylock is maybe the better approach.  In this
case we should drop the ticket arg (basically revert commit
02070f049875 ("drm/gem: Add ww_acquire_ctx support to
drm_gem_lru_scan()"))

BR,
-R

> Regards,
>
> Boris
>
> [1]https://gitlab.freedesktop.org/drm/misc/kernel/-/blob/drm-misc-next/drivers/gpu/drm/panthor/panthor_gem.c?ref_type=heads#L1425

Reply via email to