On Wed, Jan 22, 2025 at 07:51:04AM +0100, Gerhard Wiesinger wrote:
> On 05.09.2024 06:29, Kent Overstreet wrote:
> > On Sun, Sep 01, 2024 at 02:01:25PM GMT, Gerhard Wiesinger wrote:
> > > Hello,
> > > 
> > > I'm having some Fedora Linux VMs (actual versions, latest updates) in a
> > > virtual test infrastructure on Virtualbox. There I run different VMs with
> > > different filesystems (ext4, xfs, zfs, bcachefs and btrfs).
> > > 
> > > I had a hardware problem on the underlying hardware where around 1000 4k
> > > blocks could not be read anymore. I migrated with ddrescure the whole disk
> > > which worked well.
> > > 
> > > Of course I was expecting some data loss in the VMs but wanted to get them
> > > in a consistent state.
> > > 
> > > The following file systems got very easy in a consistent state with the
> > > corresponding repair/scrub tools of the filesystems:
> > > - ext4
> > > - xfs
> > > - zfs
> > > 
> > > Unfortunately 2 filesystem can't get into a state, where the filesystem
> > > repair tools report "everything fine" (of course with some loss data, but
> > > that's fine):
> > > - btrfs
> > > - bcachefs
> > > 
> > > commands run with bcachefs (git version):
> > > git log -n1 | head -n1
> > > commit 1e058db4b603f8992b781b4654b48221dd04407a
> > > ./bcachefs version
> > > 1.12.0
> > > 
> > > But bcachefs never got into a consistent state, also with newer versions.
> > > Also check with older versions (1.7.0) run for a long time.
> > > 
> > > To reproduce the problem I created a new filesystem and copied some files
> > > there:
> > > mkfs.bcachefs -f /dev/sdb
> > > time cp -Rap /usr /mnt
> > > 
> > > Afterwards I created a (quick&dirty) script "corrupt_device.sh" to corrupt
> > > the device in the same manner as the original failure (1000 4k blocks will
> > > be randomly overwritten).
> > > Script: see below
> > > 
> > > ~/corrupt_device.sh
> > > ./bcachefs fsck -pf /dev/sdb
> > > ./bcachefs fsck -pfR /dev/sdb
> > > 
> > > Result: It can be reproduced, that bcachefs can't be brought into a
> > > consistent state even after several runs of the repair.
> > > 
> > > You can also try to reproduce it and create a testcase out of it.
> > > 
> > > Any ideas how to repair and what can be done to get it into a consistent
> > > state?
> > If you've got a filesystem you want data off of - send me a metadata
> > dump (join the IRC channel, send it via magic wormhole) and I'll debug.
> > 
> > We still haven't comprehensively torture tested all the repair paths
> > (which is probably the biggest reason it's still marked as
> > experimental); all the repair paths are there, but there's still bugs to
> > shake out.
> > 
> > Thanks for the test - I'll try to make use of it when I'm working in
> > that area again.
> 
> Did you find the time for using the test and fix the issues?
> 
> As bcachefs gets more stable I think we should focus on such
> "destroy&repair" test cases to get consistent again and get trust.

Not yet, but thanks for the bump.

I did make my own "kill_btree_(node|root)" tests more comprehensive, so
some stuff has likely been fixed.

Would you want to retest? And if you could turn this into a ktest test,
that would make it really easy for me to look if there's still issues...

Reply via email to