On Sun, Nov 02, 2014 at 02:57:22PM -0700, Chris Murphy wrote: > On Nov 1, 2014, at 10:49 PM, Robert White <rwh...@pobox.com> wrote: > > > On 10/31/2014 10:34 AM, Tobias Holst wrote: > >> I am now using another system with kernel 3.17.2 and btrfs-tools 3.17 > >> and inserted one of the two HDDs of my btrfs-RAID1 to it. I can't add > >> the second one as there are only two slots in that server. > >> > >> This is what I got: > >> > >> tobby@ubuntu: sudo btrfs check /dev/sdb1 > >> warning, device 2 is missing > >> warning devid 2 not found already > >> root item for root 1746, current bytenr 80450240512, current gen > >> 163697, current level 2, new bytenr 40074067968, new gen 163707, new > >> level 2 > >> Found 1 roots with an outdated root item. > >> Please run a filesystem check with the option --repair to fix them. > >> > >> tobby@ubuntu: sudo btrfs check --repair /dev/sdb1 > >> enabling repair mode > >> warning, device 2 is missing > >> warning devid 2 not found already > >> Unable to find block group for 0 > >> extent-tree.c:289: find_search_start: Assertion `1` failed. > > > > The read-only snapshots taken under 3.17.1 are your core problem. > > > > Now btrfsck is refusing to operate on the degraded RAID because > > degraded RAID is degraded so it's read-only. (this is an educated > > guess). > > Degradedness and writability are orthogonal. If there's some problem > with the fs that prevents it from being mountable rw, then that'd > apply for both normal and degraded operation. If the fs is OK, it > should permit writable degraded mounts. > > > Since btrfsck is _not_ a mount type of operation its got no "degraded > > mode" that would let you deal with half a RAID as far as I know. > > That's a problem. I can see why a repair might need an additional flag > (maybe force) to repair a volume that has the minimum number of devices > for degraded mounting, but not all are present. Maybe we wouldn't want > it to be easy to accidentally run a repair that changes the file system > when a device happens to be missing inadvertently that could be found > and connected later. > > I think related to this is a btrfs equivalent of a bitmap. The metadata > already has this information in it, but possibly right now btrfs > lacks the equivalent behavior of mdadm readd when a previously missing > device is reconnected. If it has a bitmap then it doesn't have to be > completely rebuilt, the bitmap contains information telling md how to > "catch up" the readded device, i.e. only that which is different needs > to be written upon a readd. > > For example if I have a two device Btrfs raid1 for both data and > metadata, and one device is removed and I mount -o degraded,rw one > of them and make some small changes, unmount, then reconnect the > missing device and mount NOT degraded - what happens? I haven't tried > this.
I have. It's a filesystem-destroying disaster. Never do it, never let it happen accidentally. Make sure that if a disk gets temporarily disconnected, you either never mount it degraded, or never let it come back (i.e. take the disk to another machine and wipefs it). Don't ever, ever put 'degraded' in /etc/fstab mount options. Nope. No. btrfs seems to assume the data is correct on both disks (the generation numbers and checksums are OK) but gets confused by equally plausible but different metadata on each disk. It doesn't take long before the filesystem becomes data soup or crashes the kernel. There is more than one way to get to this point. Take LVM snapshots of the devices in a btrfs RAID1 array, and 'btrfs device scan' will see two different versions of each btrfs device in a btrfs filesystem (one for the origin LV and one for the snapshot). btrfs then assembles LVs of different vintages randomly (e.g. one from the mount command line, one from an earlier LVM snapshot of the second disk) with disastrous results similar to the above. IMHO if btrfs sees multiple devices with the same UUIDs, it should reject all of them and require an explicit device list; however, mdadm has a way to deal with this that would also work. mdadm puts event counters and timestamps in the device superblocks to prevent any such accidental disjoint assembly and modification of members of an array. If disks go temporarily offline with separate modifications then mdadm refuses to accept disks with different counter+timestamp data (so you'll get all the disks but one rejected, or only one disk with all others rejected). The rejected disk(s) has to go through full device recovery before rejoining the array--someone has to use mdadm to add the rejected disk as if it was a new, blank one. Currently btrfs won't mount a degraded array by default, which prevents unrecoverable inconsistency. That's a safe behavior for now, but sooner or later btrfs will need to be able to safely boot unattended on a degraded RAID1 root filesystem. > And I also don't know if a full balance (hours) is needed to > "catch up" the formerly missing device. With md this is very fast - > seconds/minutes depending on how much has been changed. I schedule a scrub immediately after boot, assuming that it will resolve any data differences (and also assuming that the reboot was caused by a disk-related glitch, which it usually is for me). That might not be enough for metadata differences, and it's certainly not enough for modifications in degraded mode. Full balance is out of my reach--it takes weeks on even my medium-sized filesystems, and mkfs + rsync from backup is much faster.
signature.asc
Description: Digital signature