Re: trans path overflow during metadata replication with lockdep

Brian Foster Fri, 13 Oct 2023 05:45:16 -0700

On Fri, Oct 13, 2023 at 07:52:54PM +0800, Daniel J Blueman wrote:
> On Fri, 13 Oct 2023 at 19:06, Brian Foster <[email protected]> wrote:
> > On Tue, Oct 10, 2023 at 11:20:59PM +0800, Coly Li wrote:
> > > Forward it to [email protected] 
> > > <mailto:[email protected]>
> > >
> > > > 2023年10月10日 07:43，Daniel J Blueman <[email protected]> 写道：
> > > >
> > > > Firstly, bcachefs introduces a new era of in-tree filesystems with
> > > > some monumental features (sorry, ZFS); hats off to Kent for landing
> > > > this!
> > > >
> > > > My testing finds it is in great shape; far better than BTRFS was when
> > > > it landed. Testing on linux next-20231005 with additional debug checks
> > > > atop the Ubuntu 23.04 kernel generic config [1], I was able to provoke
> > > > a btree trans path overflow cornercase [2].
> >
> > Thanks for testing and for the bug report. I'm curious what kind of
> > testing you've been doing in general and perhaps if you have any more
> > interesting results to share? ;)
> 
> Thanks for asking Brian! My validation seeks to exercise edge cases by
> triggering multiple conditions in parallel, then working backwards to
> identify a minimal and reliable reproducer.
> 
> I'll prepare a post with additional early findings in the coming days.
> 
> > > > The minimal reproducer is:
> > > > # modprobe brd rd_nr=2 rd_size=1048576
> > > > # bcachefs format --metadata_replicas=2 --label=tier1.1 /dev/ram0
> > > > --label=tier1.2 /dev/ram1
> > > > # mount -t bcachefs /dev/ram0:/dev/ram1 /mnt
> > > > # dd if=/dev/zero of=/mnt/test bs=128M
> > > >
> > > > The issue doesn't reproduce with metadata_replicas=1 or a single block 
> > > > device.
> >
> > My naive assumption would be that the higher replica count increases the
> > size of some of these transactions due to having to update multiple
> > devices, but this is not an area I've dug into in depth tbh. I've not
> > seen this yet myself, but I think most of the multi device testing I've
> > done so far has still been with replicas=1.
> 
> With the stock Ubuntu kernel config (no CONFIG_LOCKDEP, likely similar
> to your codebase), my testing eventually provoked a crash; I'll see if
> I can reproduce this later in case it's the same path as this report
> with CONFIG_LOCKDEP.
>


Ah, Ok. It would be interesting to know whether LOCKDEP is more of an
overhead or timing contributing factor (as opposed to functional).

> > > > At debug entry, I couldn't determine why BTREE_ITER_MAX must be 64
> > > > rather than 32 when CONFIG_LOCKDEP is set, however the panic doesn't
> > > > occur without CONFIG_LOCKDEP, so it appears related; keeping it at
> > > > value 32 with CONFIG_LOCKDEP doesn't prevent the panic also.
> > > >
> > > > @Kent/anyone?
> >
> > Hmm, that is interesting. I'm not sure why lockdep would be a factor
> > here. Where did you come up with the notion that ITER_MAX could be 32
> > (and the idea to test it) with CONFIG_LOCKDEP=y? It looks like it's
> > hardcoded to 64, but perhaps I'm missing something.
> 
> See 
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/fs/bcachefs/btree_types.h#n390
> 
> #ifndef CONFIG_LOCKDEP
> #define BTREE_ITER_MAX 64
> #else
> #define BTREE_ITER_MAX 32
> #endif
> 

Oh, I was looking at the master branch. I didn't realize this was
different in for-next. That looks like a relatively old commit as well.
Kent might have to chime in on what the deal is with that.

So my guess would be that this is a bit of a whack-a-mole between
accommodating lockdep and workloads that stress the transaction data
structure in this way. For the purposes of your testing, I think it's
probably more useful to either set this to 64 (and accept the more
graceful lockdep overflow situation) or just disable lockdep and confirm
whether things work correctly.

For the purposes of bcachefs, I think this means that either this
setting is not suitable for !BCACHEFS_DEBUG kernels (regardless of
lockdep state) or alternatively that handling of this limit condition
needs to be improved to be more graceful one way or another, because we
probably don't want to trade the lockdep overflow behavior (which IIRC
just splats and disables lockdep) for a kernel panic in most cases.

Brian

> > We'll see if Kent can make more sense of the crash signature whenever
> > he's back around. Otherwise I'll make a note to try and reproduce it
> > once I work through some other things...
> 
> It would be good to get some independent confirmation on this.
> 
> Thanks again!
>   Dan
> -- 
> Daniel J Blueman
>

Re: trans path overflow during metadata replication with lockdep

Reply via email to