On Mon, May 11, 2026 at 6:16 AM Rob Clark <[email protected]> wrote:
>
> On Mon, May 11, 2026 at 12:37 AM Boris Brezillon
> <[email protected]> wrote:
> >
> > Hi,
> >
> > On Sat, 9 May 2026 08:34:15 -0700
> > Rob Clark <[email protected]> wrote:
> >
> > > On Thu, May 7, 2026 at 11:57 PM Daniel J Blueman <[email protected]> wrote:
> > > >
> > > > With PROVE_LOCKING on an Snapdragon X1 and VM reclaim pressure, we see:
> > > >
> > > > """
> > > > kswapd0/121 is trying to acquire lock:
> > > > ffff800080ed3800 (reservation_ww_class_acquire){+.+.}-{0:0}, at:
> > > >   msm_gem_shrinker_scan (drivers/gpu/drm/msm/msm_gem_shrinker.c:189)
> > > >
> > > > but task is already holding lock:
> > > > ffffbf4ddb44ca40 (fs_reclaim){+.+.}-{0:0}, at:
> > > >   balance_pgdat (mm/vmscan.c:7236 (discriminator 2))
> > > >
> > > > which lock already depends on the new lock.
> > > >
> > > > the existing dependency chain (in reverse order) is:
> > > >
> > > > -> #2 (fs_reclaim){+.+.}-{0:0}:
> > > > lock_acquire (kernel/locking/lockdep.c:5868 
> > > > kernel/locking/lockdep.c:5825)
> > > > fs_reclaim_acquire (mm/page_alloc.c:4325 mm/page_alloc.c:4339)
> > > > dma_resv_lockdep (drivers/dma-buf/dma-resv.c:798)
> > > > do_one_initcall (init/main.c:1392)
> > > > kernel_init_freeable (init/main.c:1454 (discriminator 1) 
> > > > init/main.c:1470
> > > >   (discriminator 1) init/main.c:1490 (discriminator 1) init/main.c:1703
> > > >   (discriminator 1))
> > > > kernel_init (init/main.c:1593)
> > > > ret_from_fork (arch/arm64/kernel/entry.S:858)
> > > >
> > > > -> #1 (reservation_ww_class_mutex){+.+.}-{4:4}:
> > > > lock_acquire (kernel/locking/lockdep.c:5868 
> > > > kernel/locking/lockdep.c:5825)
> > > > dma_resv_lockdep (./include/linux/ww_mutex.h:164 (discriminator 1)
> > > >   drivers/dma-buf/dma-resv.c:791 (discriminator 1))
> > > > do_one_initcall (init/main.c:1392)
> > > > kernel_init_freeable (init/main.c:1454 (discriminator 1) 
> > > > init/main.c:1470
> > > >   (discriminator 1) init/main.c:1490 (discriminator 1) init/main.c:1703
> > > >   (discriminator 1))
> > > > kernel_init (init/main.c:1593)
> > > > ret_from_fork (arch/arm64/kernel/entry.S:858)
> > > >
> > > > -> #0 (reservation_ww_class_acquire){+.+.}-{0:0}:
> > > > check_prev_add (kernel/locking/lockdep.c:3165)
> > > > __lock_acquire (kernel/locking/lockdep.c:3284
> > > >   kernel/locking/lockdep.c:3908 kernel/locking/lockdep.c:5237)
> > > > lock_acquire (kernel/locking/lockdep.c:5868 
> > > > kernel/locking/lockdep.c:5825)
> > > > drm_gem_lru_scan (./include/linux/ww_mutex.h:163 (discriminator 1)
> > > >   drivers/gpu/drm/drm_gem.c:1681 (discriminator 1))
> > >
> > > Your line #s don't quite match mine, but AFAICT this is from the
> > > ww_acquire_init()
> > >
> > > What I'm unsure about is if this could cause live-lock against another
> > > operation which requires obtaining both obj and vm locks in a
> > > potentially different order (which would also be using a
> > > ww_acquire_ctx ticket to backoff in case of conflicting locking
> > > order).  It wouldn't deadlock because we don't sleep forever if we do
> > > sleep, but...
> > >
> > > Possibly we should also be using trylock to also acquire the vm lock,
> > > but lockdep would still complain as it doesn't know the ticket will be
> > > only used w/ trylock (unless we did something hacky by using a
> > > different ww_class?)
> >
> > FWIW, we started using a ticket in the initial version of the Panthor
> > shrinker, and ditched it at some point because of these unsolvable lock
> > ordering issues. It also seems to me that trylock-all-the-way is the
> > right solution, and if we trylock and back off + immediately move to
> > the next BO if any lock is already held, the ticket approach is not as
> > useful, because we're not going to use the retry mechanism provided by
> > ww_mutex anyway.
> >
> > It's true that it does the bookkeeping, which simplifies the rollback
> > procedure, but if you look at the other locks taken in the shrinker
> > path, they are static (one per-component involved in reclaim) for most
> > of them, meaning the rollback is pretty straightforward. The only
> > exception is the VM lock (one per vm_bo in case of shared BOs). In
> > panthor, we just decided to open-code this rollback logic (see
> > panthor_gem_try_evict_no_resv_wait() [1]) instead of teaching ww_mutex
> > about non-blocking locks when a ticket is provided. Not saying this is
> > the best option, but it works...
>
> hmm, ok.. and if we block waiting for BO that is before grabbing the
> vm locks so no live-lock concern (sry, been a while since I've looked
> at this)..  So I guess trylock is maybe the better approach.  In this
> case we should drop the ticket arg (basically revert commit
> 02070f049875 ("drm/gem: Add ww_acquire_ctx support to
> drm_gem_lru_scan()"))

I'll pull this into msm-fixes.  I've fixed a compile error with the
original patch (and replaced the lockdep splat with something
legible... somehow whitespace was swallowed with the original).

For next cycle I'll send a cleanup to drop the now unused ticket arg
for drm_gem_lru_scan()

BR,
-R

> BR,
> -R
>
> > Regards,
> >
> > Boris
> >
> > [1]https://gitlab.freedesktop.org/drm/misc/kernel/-/blob/drm-misc-next/drivers/gpu/drm/panthor/panthor_gem.c?ref_type=heads#L1425

Reply via email to