On 2014-10-08 15:11, Eric Sandeen wrote:
I was looking at Marc's post:

http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html

and it feels like there isn't exactly a cohesive, overarching vision for
repair of a corrupted btrfs filesystem.

In other words - I'm an admin cruising along, when the kernel throws some
fs corruption error, or for whatever reason btrfs fails to mount.
What should I do?

Marc lays out several steps, but to me this highlights that there seem to
be a lot of disjoint mechanisms out there to deal with these problems;
mostly from Marc's blog, with some bits of my own:

* btrfs scrub
        "Errors are corrected along if possible" (what *is* possible?)
* mount -o recovery
        "Enable autorecovery attempts if a bad tree root is found at mount 
time."
* mount -o degraded
        "Allow mounts to continue with missing devices."
        (This isn't really a way to recover from corruption, right?)
* btrfs-zero-log
        "remove the log tree if log tree is corrupt"
* btrfs rescue
        "Recover a damaged btrfs filesystem"
        chunk-recover
        super-recover
        How does this relate to btrfs check?
* btrfs check
        "repair a btrfs filesystem"
        --repair
        --init-csum-tree
        --init-extent-tree
        How does this relate to btrfs rescue?
* btrfs restore
        "try to salvage files from a damaged filesystem"
        (not really repair, it's disk-scraping)


What's the vision for, say, scrub vs. check vs. rescue?  Should they repair the
same errors, only online vs. offline?  If not, what class of errors does one 
fix vs.
the other?  How would an admin know?  Can btrfs check recover a bad tree root
in the same way that mount -o recovery does?  How would I know if I should use
--init-*-tree, or chunk-recover, and what are the ramifications of using
these options?

It feels like recovery tools have been badly splintered, and if there's an
overarching design or vision for btrfs fs repair, I can't tell what it is.
Can anyone help me?

Well, based on my understanding:
* btrfs scrub is intended to be almost exactly equivalent to scrubbing a RAID volume; that is, it fixes disparity between multiple copies of the same block. IOW, it isn't really repair per se, but more preventative maintnence. Currently, it only works for cases where you have multiple copies of a block (dup, raid1, and raid10 profiles), but support is planned for error correction of raid5 and raid6 profiles. * mount -o recovery I don't know much about, but AFAICT, it s more for dealing with metadata related FS corruption. * mount -o degraded is used to mount a fs configured for a raid storage profile with fewer devices than the profile minimum. It's primarily so that you can get the fs into a state where you can run 'btrfs device replace' * btrfs-zero-log only deals with log tree corruption. This would be roughly equivalent to zeroing out the journal on an XFS or ext4 filesystem, and should almost never be needed. * btrfs rescue is intended for low level recovery corruption on an offline fs. * chunk-recover I'm not entirely sure about, but I believe it's like scrub for a single chunk on an offline fs * super-recover is for dealing with corrupted superblocks, and tries to replace it with one of the other copies (which hopefully isn't corrupted) * btrfs check is intended to (eventually) be equivalent to the fsck utility for most other filesystems. Currently, it's relatively good at identifying corruption, but less so at actually fixing it. There are however, some things that it won't catch, like a superblock pointing to a corrupted root tree. * btrfs restore is essentially disk scraping, but with built-in knowledge of the filesystem's on-disk structure, which makes it more reliable than more generic tools like scalpel for files that are too big to fit in the metadata blocks, and it is pretty much essential for dealing with transparently compressed files.

In general, my personal procedure for handling a misbehaving BTRFS filesystem is: * Run btrfs check on it WITHOUT ANY OTHER OPTIONS to try to identify what's wrong
* Try mounting it using -o recovery
* Try mounting it using -o ro,recovery
* Use -o degraded only if it's a BTRFS raid set that lost a disk
* If btrfs check AND dmesg both seem to indicate that the log tree is corrupt, try btrfs-zero-log * If btrfs check indicated a corrupt superblock, try btrfs rescue super-recover
* If all of the above fails, ask for advice on the mailing list or IRC
Also, you should be running btrfs scrub regularly to correct bit-rot and force remapping of blocks with read errors. While BTRFS technically handles both transparently on reads, it only corrects thing on disk when you do a scrub.

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to