On Fri, Dec 11, 2020 at 03:25:47PM +0100, Ulrich Windl wrote: > Hi! > > While configuring a VM environment in a cluster I had setup an SLES15 SP2 > test VM using BtrFS. Due to some problem with libvirt (or the VirtualDomain > RA) the VM was active on more than one cluster node at a time, corrupting the > filesystem beyond repair it seems: > hvc0:rescue:~ # btrfs check /dev/xvda2 > Opening filesystem to check... > Checking filesystem on /dev/xvda2 > UUID: 1b651baa-327b-45fe-9512-e7147b24eb49 > [1/7] checking root items > ERROR: child eb corrupted: parent bytenr=1107230720 item=75 parent level=1 > child level=1 > ERROR: failed to repair root items: Input/output error > hvc0:rescue:~ # btrfsck -b /dev/xvda2 > Opening filesystem to check... > Checking filesystem on /dev/xvda2 > UUID: 1b651baa-327b-45fe-9512-e7147b24eb49 > [1/7] checking root items > ERROR: child eb corrupted: parent bytenr=1106952192 item=75 parent level=1 > child level=1 > ERROR: failed to repair root items: Input/output error > hvc0:rescue:~ # btrfsck --repair /dev/xvda2 > enabling repair mode > Opening filesystem to check... > Checking filesystem on /dev/xvda2 > UUID: 1b651baa-327b-45fe-9512-e7147b24eb49 > [1/7] checking root items > ERROR: child eb corrupted: parent bytenr=1107230720 item=75 parent level=1 > child level=1 > ERROR: failed to repair root items: Input/output error > > Two questions arising: > 1) Can't the kernel set some "open flag" early when opening the > filesystem, and refuse to open it again (the other VM) when the flag > is set? That could avoid such situations I guess
If btrfs wrote "the filesystem is open" to the disk, the filesystem would not be mountable after a crash. The kernel does set an "open flag" (it detects that it is about to mount the same btrfs by uuid, and does something like a bind mount instead) but that applies only to multiple btrfs mounts on the _same_ kernel. In your case there are multiple kernels present (one in each node) and there's no way for them to communicate with each other. There are at least 3 different ways libvirt or other hosting infrastructure software on the VM host could have avoided passing the same physical device to multiple VM guests. I would suggest implementing some or all of them. > 2) Can't btrfs check try somewhat harder to rescue anything, or is > the fs structure in a way that everything is lost? > What really puzzles me is this: > There are several snapshots and subvolumes on the BtFS device. It's > hard to believe that absolutely nothing seems to be recoverable. The most likely outcome is that the root tree nodes and most of the interior nodes of all the filesystem trees are broken. The kernel relies on the trees to work--everything in btrfs except the superblocks can be at any location on disk--so the filesystem will be unreadable by the kernel. Only recovery tools would be able to read the filesystem now. Recovery requires a brute force search of the disk to find as many surviving leaf nodes as possible and rebuild the filesystem trees. This is more or less what 'btrfs check --repair --init-extent-tree' does. If you run --init-extent-tree, assuming it works (you should not assume that it will work), you would then have to audit the filesystem contents to see what data was not recovered. At a minimum, you would lose a few hundred filesystem items, since each metadata leaf node contains around 200 items and you definitely will not recover them all. The data csum trees might not be in sync with the rest of the filesytem, so you can't rely on scrub to check data integrity. If this is successful, you will have a similar result to mounting ext4 on multiple VMs simultaneously-- fsck runs, the filesystem is read-write again, but you don't get all the data back, nor even a list of data that was lost or corrupted. --init-extent-tree can be quite slow, especially if you don't have enough RAM to hold all the filesystem's metadata. It's still under development, so one possible outcome is that it crashes with an assertion failure and leaves you with a even more broken filesystem. It's usually faster and easier to mkfs and restore from backups instead. > I have this: > hvc0:rescue:~ # btrfs inspect-internal dump-super /dev/xvda2 > superblock: bytenr=65536, device=/dev/xvda2 > --------------------------------------------------------- > csum_type 0 (crc32c) > csum_size 4 > csum 0x659898f3 [match] > bytenr 65536 > flags 0x1 > ( WRITTEN ) > magic _BHRfS_M [match] > fsid 1b651baa-327b-45fe-9512-e7147b24eb49 > metadata_uuid 1b651baa-327b-45fe-9512-e7147b24eb49 > label > generation 280 > root 1107214336 > sys_array_size 97 > chunk_root_generation 35 > root_level 0 > chunk_root 1048576 > chunk_root_level 0 > log_root 0 > log_root_transid 0 > log_root_level 0 > total_bytes 10727960576 > bytes_used 1461825536 > sectorsize 4096 > nodesize 16384 > leafsize (deprecated) 16384 > stripesize 4096 > root_dir 6 > num_devices 1 > compat_flags 0x0 > compat_ro_flags 0x0 > incompat_flags 0x163 > ( MIXED_BACKREF | > DEFAULT_SUBVOL | > BIG_METADATA | > EXTENDED_IREF | > SKINNY_METADATA ) > cache_generation 280 > uuid_tree_generation 40 > dev_item.uuid 2abdf93e-2f2d-4eef-a1d8-9325f809ebce > dev_item.fsid 1b651baa-327b-45fe-9512-e7147b24eb49 [match] > dev_item.type 0 > dev_item.total_bytes 10727960576 > dev_item.bytes_used 2436890624 > dev_item.io_align 4096 > dev_item.io_width 4096 > dev_item.sector_size 4096 > dev_item.devid 1 > dev_item.dev_group 0 > dev_item.seek_speed 0 > dev_item.bandwidth 0 > dev_item.generation 0 > > Regards, > Ulrich Windl > >