On Thu, 17 Apr 2025, LongPing Wei wrote:

> On 2025/4/16 20:22, Mikulas Patocka wrote:
> > 
> > 
> > On Wed, 16 Apr 2025, LongPing Wei wrote:
> > 
> > > A BUG was reported as below when CONFIG_DEBUG_ATOMIC_SLEEP and
> > > try_verify_in_tasklet are enabled.
> > > 
> > > diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> > > index 9c8ed65cd87e..bdbd79e76f96 100644
> > > --- a/drivers/md/dm-bufio.c
> > > +++ b/drivers/md/dm-bufio.c
> > > @@ -2424,6 +2424,10 @@ static void __scan(struct dm_bufio_client *c)
> > >                           atomic_long_dec(&c->need_shrink);
> > >                           freed++;
> > > +
> > > +                 if (static_branch_unlikely(&no_sleep_enabled) &&
> > > c->no_sleep)
> > > +                         continue;
> > > +
> > >                           cond_resched();
> > >                   }
> > >           }
> > > -- 
> > > 2.34.1
> > 
> > Hi
> > 
> > Thanks for the bug report. I think it is not a good practice in general to
> > hold a spinlock and loop over a list of unbounded size. So, I modified
> > your patch so that it drops the lock after every 16 iterations and calls
> > cond_resched() with the lock dropped.
> > 
> > Please, test this patch and report whether it works for you.
> > 
> > Mikulas
> > 
> > 
> > [PATCH] dm-bufio: don't schedule in atomic context
> > 
> > A BUG was reported as below when CONFIG_DEBUG_ATOMIC_SLEEP and
> > try_verify_in_tasklet are enabled.
> > [  129.444685][  T934] BUG: sleeping function called from invalid context at
> > drivers/md/dm-bufio.c:2421
> > [  129.444723][  T934] in_atomic(): 1, irqs_disabled(): 0, non_block: 0,
> > pid: 934, name: kworker/1:4
> > [  129.444740][  T934] preempt_count: 201, expected: 0
> > [  129.444756][  T934] RCU nest depth: 0, expected: 0
> > [  129.444781][  T934] Preemption disabled at:
> > [  129.444789][  T934] [<ffffffd816231900>] shrink_work+0x21c/0x248
> > [  129.445167][  T934] kernel BUG at kernel/sched/walt/walt_debug.c:16!
> > [  129.445183][  T934] Internal error: Oops - BUG: 00000000f2000800 [#1]
> > PREEMPT SMP
> > [  129.445204][  T934] Skip md ftrace buffer dump for: 0x1609e0
> > [  129.447348][  T934] CPU: 1 PID: 934 Comm: kworker/1:4 Tainted: G        W
> > OE      6.6.56-android15-8-o-g6f82312b30b9-debug #1
> > 1400000003000000474e5500b3187743670464e8
> > [  129.447362][  T934] Hardware name: Qualcomm Technologies, Inc. Parrot
> > QRD, Alpha-M (DT)
> > [  129.447373][  T934] Workqueue: dm_bufio_cache shrink_work
> > [  129.447394][  T934] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
> > BTYPE=--)
> > [  129.447406][  T934] pc : android_rvh_schedule_bug+0x0/0x8
> > [sched_walt_debug]
> > [  129.447435][  T934] lr : __traceiter_android_rvh_schedule_bug+0x44/0x6c
> > [  129.447451][  T934] sp : ffffffc0843dbc90
> > [  129.447459][  T934] x29: ffffffc0843dbc90 x28: ffffffffffffffff x27:
> > 0000000000000c8b
> > [  129.447479][  T934] x26: 0000000000000040 x25: ffffff804b3d6260 x24:
> > ffffffd816232b68
> > [  129.447497][  T934] x23: ffffff805171c5b4 x22: 0000000000000000 x21:
> > ffffffd816231900
> > [  129.447517][  T934] x20: ffffff80306ba898 x19: 0000000000000000 x18:
> > ffffffc084159030
> > [  129.447535][  T934] x17: 00000000d2b5dd1f x16: 00000000d2b5dd1f x15:
> > ffffffd816720358
> > [  129.447554][  T934] x14: 0000000000000004 x13: ffffff89ef978000 x12:
> > 0000000000000003
> > [  129.447572][  T934] x11: ffffffd817a823c4 x10: 0000000000000202 x9 :
> > 7e779c5735de9400
> > [  129.447591][  T934] x8 : ffffffd81560d004 x7 : 205b5d3938373434 x6 :
> > ffffffd8167397c8
> > [  129.447610][  T934] x5 : 0000000000000000 x4 : 0000000000000001 x3 :
> > ffffffc0843db9e0
> > [  129.447629][  T934] x2 : 0000000000002f15 x1 : 0000000000000000 x0 :
> > 0000000000000000
> > [  129.447647][  T934] Call trace:
> > [  129.447655][  T934]  android_rvh_schedule_bug+0x0/0x8 [sched_walt_debug
> > 1400000003000000474e550080cce8a8a78606b6]
> > [  129.447681][  T934]  __might_resched+0x190/0x1a8
> > [  129.447694][  T934]  shrink_work+0x180/0x248
> > [  129.447706][  T934]  process_one_work+0x260/0x624
> > [  129.447718][  T934]  worker_thread+0x28c/0x454
> > [  129.447729][  T934]  kthread+0x118/0x158
> > [  129.447742][  T934]  ret_from_fork+0x10/0x20
> > [  129.447761][  T934] Code: ???????? ???????? ???????? d2b5dd1f (d4210000)
> > [  129.447772][  T934] ---[ end trace 0000000000000000 ]---
> > 
> > dm_bufio_lock will call spin_lock_bh when try_verify_in_tasklet
> > is enabled, and __scan will be called in atomic context.
> > 
> > Signed-off-by: LongPing Wei <weilongp...@oppo.com>
> > Cc: sta...@vger.kernel.org
> > 
> > ---
> >   drivers/md/dm-bufio.c |    7 ++++++-
> >   1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > Index: linux-2.6/drivers/md/dm-bufio.c
> > ===================================================================
> > --- linux-2.6.orig/drivers/md/dm-bufio.c    2025-04-14 16:01:58.000000000
> > +0200
> > +++ linux-2.6/drivers/md/dm-bufio.c 2025-04-16 14:18:44.000000000 +0200
> > @@ -2424,7 +2424,12 @@ static void __scan(struct dm_bufio_clien
> >                             atomic_long_dec(&c->need_shrink);
> >                     freed++;
> > -                   cond_resched();
> > +
> > +                   if (!(freed & 15)) {
> > +                           dm_bufio_unlock(c);
> > +                           cond_resched();
> > +                           dm_bufio_lock(c);
> > +                   }
> >             }
> >     }
> >   }
> > 
> 
> Hi, Mikulas
> 
> I have tested my v2 patch with try_verify_in_tasklet enabled on my
> device and the BUG is fixed.
> 
> Should we change the current behaviour for devices wihtout
> try_verify_in_tasklet enabled?
> 
> Thanks
> 
> LongPing

Yes, I think that dropping the lock is usefull even for configurations 
without try_verify_in_tasklet, so that dm-bufio doesn't have too big 
latency. I accepted the v3 patch that you sent.

Mikulas


Reply via email to