On Wed, Jan 22, 2025 at 07:51:04AM +0100, Gerhard Wiesinger wrote: > On 05.09.2024 06:29, Kent Overstreet wrote: > > On Sun, Sep 01, 2024 at 02:01:25PM GMT, Gerhard Wiesinger wrote: > > > Hello, > > > > > > I'm having some Fedora Linux VMs (actual versions, latest updates) in a > > > virtual test infrastructure on Virtualbox. There I run different VMs with > > > different filesystems (ext4, xfs, zfs, bcachefs and btrfs). > > > > > > I had a hardware problem on the underlying hardware where around 1000 4k > > > blocks could not be read anymore. I migrated with ddrescure the whole disk > > > which worked well. > > > > > > Of course I was expecting some data loss in the VMs but wanted to get them > > > in a consistent state. > > > > > > The following file systems got very easy in a consistent state with the > > > corresponding repair/scrub tools of the filesystems: > > > - ext4 > > > - xfs > > > - zfs > > > > > > Unfortunately 2 filesystem can't get into a state, where the filesystem > > > repair tools report "everything fine" (of course with some loss data, but > > > that's fine): > > > - btrfs > > > - bcachefs > > > > > > commands run with bcachefs (git version): > > > git log -n1 | head -n1 > > > commit 1e058db4b603f8992b781b4654b48221dd04407a > > > ./bcachefs version > > > 1.12.0 > > > > > > But bcachefs never got into a consistent state, also with newer versions. > > > Also check with older versions (1.7.0) run for a long time. > > > > > > To reproduce the problem I created a new filesystem and copied some files > > > there: > > > mkfs.bcachefs -f /dev/sdb > > > time cp -Rap /usr /mnt > > > > > > Afterwards I created a (quick&dirty) script "corrupt_device.sh" to corrupt > > > the device in the same manner as the original failure (1000 4k blocks will > > > be randomly overwritten). > > > Script: see below > > > > > > ~/corrupt_device.sh > > > ./bcachefs fsck -pf /dev/sdb > > > ./bcachefs fsck -pfR /dev/sdb > > > > > > Result: It can be reproduced, that bcachefs can't be brought into a > > > consistent state even after several runs of the repair. > > > > > > You can also try to reproduce it and create a testcase out of it. > > > > > > Any ideas how to repair and what can be done to get it into a consistent > > > state? > > If you've got a filesystem you want data off of - send me a metadata > > dump (join the IRC channel, send it via magic wormhole) and I'll debug. > > > > We still haven't comprehensively torture tested all the repair paths > > (which is probably the biggest reason it's still marked as > > experimental); all the repair paths are there, but there's still bugs to > > shake out. > > > > Thanks for the test - I'll try to make use of it when I'm working in > > that area again. > > Did you find the time for using the test and fix the issues? > > As bcachefs gets more stable I think we should focus on such > "destroy&repair" test cases to get consistent again and get trust.
Not yet, but thanks for the bump. I did make my own "kill_btree_(node|root)" tests more comprehensive, so some stuff has likely been fixed. Would you want to retest? And if you could turn this into a ktest test, that would make it really easy for me to look if there's still issues...
