Re: Re: What is the vision for btrfs fs repair?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 10/11/2014 3:29 AM, Goffredo Baroncelli wrote: On 10/10/2014 12:53 PM, Bob Marley wrote: If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery. No way! I wouldn't want a default like that. If you think at distributed transactions: suppose a sync was issued on both sides of a distributed transaction, then power was lost on one side, than btrfs had corruption. When I remount it, definitely the worst thing that can happen is that it auto-rolls-back to a previous known-good state. I cannot agree. I consider a sane default to have a consistent state with the recently data written lost, instead of require the user intervention to not lost anything. To address your requirement, we need a super sync command which ensure that the data are in the filesystem and not only in the log (as sync should ensure). I have to agree. There is a reason we have fsck -p and why that is what is run at boot time. Some repairs involve a tradeoff that will result in permanent data loss that maybe could be avoided by going the other way, or performing manual recovery. Such repairs should never be done automatically by default. For that matter I'm not even sure this sort of thing should be there as a mount option at all. It really should require a manual fsck run with a big warning that *THIS WILL THROW OUT SOME DATA*. Now if the data is saved to a snapshot or something so you can manually try to recover it later rather than being thrown out wholesale, I can see that being done automatically at boot time. Of course, if btrfs is that damaged then wouldn't grub be unable to load your kernel in the first place? -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUamDQAAoJEI5FoCIzSKrwaYAIAKXgkGBbBZj6yUuLC1+euim6 6Xqer1DiGywEiO4UPaxmq3rHDOlZlyIamDpUi7nIvbfK+TgBWfEVtLvdd6shjfqA FvFv7t+X2mlAyk+iGffSK1w9/qgEhE55M35exba95Cdsn0ezos4LpvTsL1128nkx uGzYQcoYj1irkmDp133JuHYAxhrAp0Q6PB+5gIgWfRsVbGezcxg5FvqzotEq1J/d 7MT1FvdoUo5qt2j/KzTUfD5AlFhsXE5beykakMdFmoHlTCQAxEeUU21z6APclkxF /b/ppLt603Vpb6rpKvNUyBy1TuPr6FJEx5O2qWUWlhRxkOUB98M86KHyWVBHtMM= =uG+h -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
On 2014-10-10 18:05, Eric Sandeen wrote: On 10/10/14 2:35 PM, Austin S Hemmelgarn wrote: On 2014-10-10 13:43, Bob Marley wrote: On 10/10/2014 16:37, Chris Murphy wrote: The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot. A filesystem which is suited for general purpose use is a filesystem which honors fsync, and doesn't *ever* auto-roll-back without user intervention. Anything different is not suited for database transactions at all. Any paid service which has the users database on btrfs is going to be at risk of losing payments, and probably without the company even knowing. If btrfs goes this way I hope a big warning is written on the wiki and on the manpages telling that this filesystem is totally unsuitable for hosting databases performing transactions. If they need reliability, they should have some form of redundancy in-place and/or run the database directly on the block device; because even ext4, XFS, and pretty much every other filesystem can lose data sometimes, Not if i.e. fsync returns. If the data is gone later, it's a hardware problem, or occasionally a bug - bugs that are usually found fixed pretty quickly. Yes, barring bugs and hardware problems they won't lose data. the difference being that those tend to give worse results when hardware is misbehaving than BTRFS does, because BTRFS usually has a old copy of whatever data structure gets corrupted to fall back on. I'm curious, is that based on conjecture or real-world testing? I wouldn't really call it testing, but based on personal experience I know that ext4 can lose whole directory sub-trees if it gets a single corrupt sector in the wrong place. I've also had that happen on FAT32 and (somewhat interestingly) HFS+ with failing/misbehaving hardware; and I've actually had individual files disappear on HFS+ without any discernible hardware issues. I don't have as much experience with XFS, but would assume based on what I do know of it that it could have similar issues. As for BTRFS, I've only ever had any issues with it 3 times, one was due to the kernel panicking during resume from S1, and the other two were due to hardware problems that would have caused issues on most other filesystems as well. In both cases of hardware issues, while the filesystem was initially unmountable, it was relatively simple to fix once I knew how. I tried to fix an ext4 fs that had become unmountable due to dropped writes once, and that was anything but simple, even with the much greater amount of documentation. smime.p7s Description: S/MIME Cryptographic Signature
Re: What is the vision for btrfs fs repair?
On 2014-10-12 06:14, Martin Steigerwald wrote: Am Freitag, 10. Oktober 2014, 10:37:44 schrieb Chris Murphy: On Oct 10, 2014, at 6:53 AM, Bob Marley bobmar...@shiftmail.org wrote: On 10/10/2014 03:58, Chris Murphy wrote: * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery. No way! I wouldn't want a default like that. If you think at distributed transactions: suppose a sync was issued on both sides of a distributed transaction, then power was lost on one side, than btrfs had corruption. When I remount it, definitely the worst thing that can happen is that it auto-rolls-back to a previous known-good state. For a general purpose file system, losing 30 seconds (or less) of questionably committed data, likely corrupt, is a file system that won't mount without user intervention, which requires a secret decoder ring to get it to mount at all. And may require the use of specialized tools to retrieve that data in any case. The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot. To understand this a bit better: What can be the reasons a recent tree gets corrupted? Well, so far I have had the following cause corrupted trees: 1. Kernel panic during resume from ACPI S1 (suspend to RAM), which just happened to be in the middle of a tree commit. 2. Generic power loss during a tree commit. 3. A device not properly honoring write-barriers (the operations immediately adjacent to the write barrier weren't being ordered correctly all the time). Based on what I know about BTRFS, the following could also cause problems: 1. A single-event-upset somewhere in the write path. 2. The kernel issuing a write to the wrong device (I haven't had this happen to me, but know people who have). In general, any of these will cause problems for pretty much any filesystem, not just BTRFS. I always thought with a controller and device and driver combination that honors fsync with BTRFS it would either be the new state of the last known good state *anyway*. So where does the need to rollback arise from? I think that in this case the term rollback is a bit ambiguous, here it means from the point of view of userspace, which sees the FS as having 'rolled-back' from the most recent state to the last known good state. That said all journalling filesystems have some sort of rollback as far as I understand: If the last journal entry is incomplete they discard it on journal replay. So even there you use the last seconds of write activity. But in case fsync() returns the data needs to be safe on disk. I always thought BTRFS honors this under *any* circumstance. If some proposed autorollback breaks this guarentee, I think something is broke elsewhere. And fsync is an fsync is an fsync. Its semantics are clear as crystal. There is nothing, absolutely nothing to discuss about it. An fsync completes if the device itself reported Yeah, I have the data on disk, all safe and cool to go. Anything else is a bug IMO. Or a hardware issue, most filesystems need disks to properly honor write barriers to provide guaranteed semantics on an fsync, and many consumer disk drives still don't honor them consistently. smime.p7s Description: S/MIME Cryptographic Signature
Re: What is the vision for btrfs fs repair?
On Sun, Oct 12, 2014 at 6:14 AM, Martin Steigerwald mar...@lichtvoll.de wrote: Am Freitag, 10. Oktober 2014, 10:37:44 schrieb Chris Murphy: On Oct 10, 2014, at 6:53 AM, Bob Marley bobmar...@shiftmail.org wrote: On 10/10/2014 03:58, Chris Murphy wrote: * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery. No way! I wouldn't want a default like that. If you think at distributed transactions: suppose a sync was issued on both sides of a distributed transaction, then power was lost on one side, than btrfs had corruption. When I remount it, definitely the worst thing that can happen is that it auto-rolls-back to a previous known-good state. For a general purpose file system, losing 30 seconds (or less) of questionably committed data, likely corrupt, is a file system that won't mount without user intervention, which requires a secret decoder ring to get it to mount at all. And may require the use of specialized tools to retrieve that data in any case. The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot. To understand this a bit better: What can be the reasons a recent tree gets corrupted? I always thought with a controller and device and driver combination that honors fsync with BTRFS it would either be the new state of the last known good state *anyway*. So where does the need to rollback arise from? In theory the recover option should never be necessary. Btrfs makes all the guarantees everybody wants it to - when the data is fsynced then it will never be lost. The question is what should happen when a corrupted tree root, which should never happen, happens anyway. The options are to refuse to mount the filesystem by default, or mount it by default discarding about 30-60s worth of writes. And yes, when this situation happens (whether it mounts by default or not) btrfs has broken its promise of data being written after a successful fsync return. As has been pointed out, braindead drive firmware is the most likely cause of this sort of issue. However, there are a number of other hardware and software errors that could cause it, including errors in linux outside of btrfs, and of course bugs in btrfs as well. In an ideal world no filesystem would need any kind of recovery/repair tools. They can often mean that the fsync promise was broken. The real question is, once that has happened, how do you move on? I think the best default is to auto-recover, but to have better facilities for reporting errors to the user. Right now btrfs is very quiet about failures - maybe a cryptic message in dmesg, and nobody reads all of that unless they're looking for something. If btrfs could report significant issues that might mitigate the impact of an auto-recovery. Also, another thing to consider during recovery is whether the damaged data could be optionally stored in a snapshot of some kind - maybe in the way that ext3/4 rollback data after conversion gets stored in a snapshot. My knowledge of the underlying structures is weak, but I'd think that a corrupted tree root practically is a snapshot already, and turning it into one might even be easier than cleaning it up. Of course, we would need to ensure the snapshot could be deleted without further error. Doing anything with the snapshot might require special tools, but if people want to do disk scraping they could. -- Rich -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
On 10/08/2014 03:11 PM, Eric Sandeen wrote: I was looking at Marc's post: https://urldefense.proofpoint.com/v1/url?u=http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.htmlk=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0Ar=cKCbChRKsMpTX8ybrSkonQ%3D%3D%0Am=XJPoqgf9jjvuE1IqCerEXXuwF4w3hbDS3%2F63x5KI4R4%3D%0As=b1f817d758eacf914bd60f20ada715384e13c1f8e040100794b0cb21261ec884 and it feels like there isn't exactly a cohesive, overarching vision for repair of a corrupted btrfs filesystem. In other words - I'm an admin cruising along, when the kernel throws some fs corruption error, or for whatever reason btrfs fails to mount. What should I do? Marc lays out several steps, but to me this highlights that there seem to be a lot of disjoint mechanisms out there to deal with these problems; mostly from Marc's blog, with some bits of my own: * btrfs scrub Errors are corrected along if possible (what *is* possible?) * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. * mount -o degraded Allow mounts to continue with missing devices. (This isn't really a way to recover from corruption, right?) * btrfs-zero-log remove the log tree if log tree is corrupt * btrfs rescue Recover a damaged btrfs filesystem chunk-recover super-recover How does this relate to btrfs check? * btrfs check repair a btrfs filesystem --repair --init-csum-tree --init-extent-tree How does this relate to btrfs rescue? * btrfs restore try to salvage files from a damaged filesystem (not really repair, it's disk-scraping) What's the vision for, say, scrub vs. check vs. rescue? Should they repair the same errors, only online vs. offline? If not, what class of errors does one fix vs. the other? How would an admin know? Can btrfs check recover a bad tree root in the same way that mount -o recovery does? How would I know if I should use --init-*-tree, or chunk-recover, and what are the ramifications of using these options? It feels like recovery tools have been badly splintered, and if there's an overarching design or vision for btrfs fs repair, I can't tell what it is. Can anyone help me? We probably should just consolidate under 3 commands, one for online checking, one for offline repair and one for pulling stuff off of the disk when things go to hell. A lot of these tools were born out of the fact that we didn't have a fsck tool for a long time so there were these stop gaps put into place, so now its time to go back and clean it up. I'll try and do this after I finish my cleanup/sync between kernel and progs work and fill out the documentation a little better so its clear when to use what. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
Am Donnerstag, 9. Oktober 2014, 21:58:53 schrieben Sie: * btrfs-zero-log remove the log tree if log tree is corrupt * btrfs rescue Recover a damaged btrfs filesystem chunk-recover super-recover How does this relate to btrfs check? * btrfs check repair a btrfs filesystem --repair --init-csum-tree --init-extent-tree How does this relate to btrfs rescue? These three translate into eight combinations of repairs, adding -o recovery there are 9 combinations. I think this is the main source of confusion, there are just too many options, but also it's completely non-obvious which one to use in which situation. My expectation is that eventually these get consolidated into just check and check --repair. As the repair code matures, it'd go into kernel autorecovery code. That's a guess on my part, but it's consistent with design goals. Also I think these should at least all be unter the btrfs command. So include btrfs-zero-log in btrfs command. And well how about btrfs repair or btrfs check as upper category and at least add the various options as commands below it? So there is at least one command and one place in manpage to learn about the various options. But maybe some can be made automatic as well. Or folded into btrfs check -- repair. Ideally it would auto-detect which path to take on filesystem recovery. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
Am Freitag, 10. Oktober 2014, 10:37:44 schrieb Chris Murphy: On Oct 10, 2014, at 6:53 AM, Bob Marley bobmar...@shiftmail.org wrote: On 10/10/2014 03:58, Chris Murphy wrote: * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery. No way! I wouldn't want a default like that. If you think at distributed transactions: suppose a sync was issued on both sides of a distributed transaction, then power was lost on one side, than btrfs had corruption. When I remount it, definitely the worst thing that can happen is that it auto-rolls-back to a previous known-good state. For a general purpose file system, losing 30 seconds (or less) of questionably committed data, likely corrupt, is a file system that won't mount without user intervention, which requires a secret decoder ring to get it to mount at all. And may require the use of specialized tools to retrieve that data in any case. The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot. To understand this a bit better: What can be the reasons a recent tree gets corrupted? I always thought with a controller and device and driver combination that honors fsync with BTRFS it would either be the new state of the last known good state *anyway*. So where does the need to rollback arise from? That said all journalling filesystems have some sort of rollback as far as I understand: If the last journal entry is incomplete they discard it on journal replay. So even there you use the last seconds of write activity. But in case fsync() returns the data needs to be safe on disk. I always thought BTRFS honors this under *any* circumstance. If some proposed autorollback breaks this guarentee, I think something is broke elsewhere. And fsync is an fsync is an fsync. Its semantics are clear as crystal. There is nothing, absolutely nothing to discuss about it. An fsync completes if the device itself reported Yeah, I have the data on disk, all safe and cool to go. Anything else is a bug IMO. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
Am Mittwoch, 8. Oktober 2014, 14:11:51 schrieb Eric Sandeen: I was looking at Marc's post: http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub- and-Btrfs-Filesystem-Repair.html and it feels like there isn't exactly a cohesive, overarching vision for repair of a corrupted btrfs filesystem. In other words - I'm an admin cruising along, when the kernel throws some fs corruption error, or for whatever reason btrfs fails to mount. What should I do? Marc lays out several steps, but to me this highlights that there seem to be a lot of disjoint mechanisms out there to deal with these problems; mostly from Marc's blog, with some bits of my own: * btrfs scrub Errors are corrected along if possible (what *is* possible?) * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. * mount -o degraded Allow mounts to continue with missing devices. (This isn't really a way to recover from corruption, right?) * btrfs-zero-log remove the log tree if log tree is corrupt * btrfs rescue Recover a damaged btrfs filesystem chunk-recover super-recover How does this relate to btrfs check? * btrfs check repair a btrfs filesystem --repair --init-csum-tree --init-extent-tree How does this relate to btrfs rescue? * btrfs restore try to salvage files from a damaged filesystem (not really repair, it's disk-scraping) What's the vision for, say, scrub vs. check vs. rescue? Should they repair the same errors, only online vs. offline? If not, what class of errors does one fix vs. the other? How would an admin know? Can btrfs check recover a bad tree root in the same way that mount -o recovery does? How would I know if I should use --init-*-tree, or chunk-recover, and what are the ramifications of using these options? It feels like recovery tools have been badly splintered, and if there's an overarching design or vision for btrfs fs repair, I can't tell what it is. Can anyone help me? How about taking one step back: What are the possible corruption cases these tools are meant to address? *Where* can BTRFS break and *why*? What of it can be folded into one command? Where can BTRFS be improved to either prevent a corruption from happening ot automatically correcting it? What actions can be determined automatically by the repair tool? What needs to be options for the user to choose from? And what guidance would the user need to decide? I.e. really going to back what diagnosing and repair of BTRFS actually includes and then well… go about a vision how this all can fit together as you suggested. As a minimum I suggest to have all possible options as a main category in btrfs command, no external commands whatsoever, so if btrfs-zero-log is still needed, at it into btrfs command. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
Martin Steigerwald posted on Sun, 12 Oct 2014 12:14:01 +0200 as excerpted: I always thought with a controller and device and driver combination that honors fsync with BTRFS it would either be the new state of the last known good state *anyway*. So where does the need to rollback arise from? My understanding here is... With btrfs a full-tree commit is atomic. You should get either the old tree or the new tree. However, due to the cascading nature of updates on cow-based structures, these full-tree commits are done by default (there's a mount-option to adjust it) every 30 seconds. Between these atomic commits partial updates may have occurred. The btrfs log (the one that btrfs-zero-log kills) is limited to between-commit updates, and thus to the upto 30 seconds (default) worth of changes since the last full- tree atomic commit. In addition to that, there's a history of tree-root commits kept (with the superblocks pointing to the last one). Btrfs-find-tree-root can be used to list this history. The recovery mount option simply allows btrfs to fall back to this history, should the current root be corrupted. Btrfs restore can be used to list tree roots as well, and can be pointed at an appropriate one if necessary. Fsync forces the file and its corresponding metadata update to the log and barring hardware or software bugs should not return until it's safely in the log, but I'm not sure whether it forces a full-tree commit. Either way the guarantees should be the same. If the log can be replayed or a full-tree commit has occurred since the fsync, the new copy should appear. If it can't, the rollback to the last atomic tree commit should return an intact copy of the file from that point. If the recovery mount option is used and a further rollback to an earlier full-tree commit is forced, provided it existed at the point of that full-tree commit, the intact file at that point should appear. So if the current tree root is a good one, the log will replay the last upto 30 seconds of activity on top of that last atomic tree root. If the current root tree itself is corrupt, the recovery mount option will let an earlier one be used. Obviously in that case the log will be discarded since it applies to a later root tree that itself has been discarded. The debate is whether recovery should be automated so the admin doesn't have to care about it, or whether having to manually add that option serves as a necessary notifier to the admin that something /did/ go wrong, and that an earlier root is being used instead, so more than a few seconds worth of data may have disappeared. As someone else has already suggested, I'd argue that as long as btrfs continues to be under the sort of development it's in now, keeping recovery as a non-default option is desired. Once it's optimized and considered stable, arguably recovery should be made the default, perhaps with a no-recovery option for those who prefer that in-the-face notification in the form of a mount error, if btrfs would otherwise fall back to an earlier tree root commit. What worries me, however, is that IMO the recent warning stripping was premature. Btrfs is certainly NOT fully stable or optimized for normal use at this point. We're still using the even/odd PID balancing scheme for raid1 reads, for instance, and multi-device writes are still serialized when they could be parallelized to a much larger degree (tho keeping some serialization is arguably good for data safety). Arguably optimizing that now would be premature optimization since the code itself is still subject to change, so I'm not complaining, but by that very same token, it *IS* still subject to change, which by definition means it's *NOT* stable, so why are we removing all the warnings and giving the impression that it IS stable? The decision wasn't mine to make and I don't know, but while a nice suggestion, making recovery-by-default a measure of when btrfs goes stable simply won't work, because surely, the same folks behind the warning stripping would then ensure this indicator too, said btrfs was stable, while the state of the code itself continues to say otherwise. Meanwhile, if your distributed transactions scenario doesn't account for crash and loss of data on one side with real-time backup/redundancy, such that loss of a few seconds worth of transactions on a single local filesystem is going to kill the entire scenario, I don't think too much of that scenario in the first place, and regardless, btrfs, certainly in its current state, is definitely NOT an appropriate base for it. Use appropriate tools for the task. Btrfs at least at this point is simply not an appropriate tool for that task. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe
Re: What is the vision for btrfs fs repair?
On 10/10/2014 12:53 PM, Bob Marley wrote: If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery. No way! I wouldn't want a default like that. If you think at distributed transactions: suppose a sync was issued on both sides of a distributed transaction, then power was lost on one side, than btrfs had corruption. When I remount it, definitely the worst thing that can happen is that it auto-rolls-back to a previous known-good state. I cannot agree. I consider a sane default to have a consistent state with the recently data written lost, instead of require the user intervention to not lost anything. To address your requirement, we need a super sync command which ensure that the data are in the filesystem and not only in the log (as sync should ensure). BR -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
On 10/10/2014 03:58, Chris Murphy wrote: * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery. No way! I wouldn't want a default like that. If you think at distributed transactions: suppose a sync was issued on both sides of a distributed transaction, then power was lost on one side, than btrfs had corruption. When I remount it, definitely the worst thing that can happen is that it auto-rolls-back to a previous known-good state. Now if I can express wishes: I would like an option that spits out all the usable tree roots (or what's the name, superblocks?) and not just the newest one which is corrupt. And then another option that lets me mount *readonly* starting from the tree root I specify. So I can check how much of the data is still there. After I decide that such tree root is good, I need another option that lets me mount with such tree root in readwrite mode, and obviously eliminating all tree roots newer than that. Some time ago I read that mounting the filesystem with an earlier tree root was possible, but only by manually erasing the disk regions in which the newer superblocks are. This is crazy, it's too risky on too many levels, and also as I wrote I want to check what data is available on a certain tree root before mounting readwrite from that one. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
On Fri, 10 Oct 2014 12:53:38 +0200 Bob Marley bobmar...@shiftmail.org wrote: On 10/10/2014 03:58, Chris Murphy wrote: * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery. No way! I wouldn't want a default like that. If you think at distributed transactions: suppose a sync was issued on both sides of a distributed transaction, then power was lost on one side What distributed transactions? Btrfs is not a clustered filesystem[1], it does not support and likely will never support being mounted from multiple hosts at the same time. [1]http://en.wikipedia.org/wiki/Clustered_file_system -- With respect, Roman signature.asc Description: PGP signature
Re: What is the vision for btrfs fs repair?
On 10/10/2014 12:59, Roman Mamedov wrote: On Fri, 10 Oct 2014 12:53:38 +0200 Bob Marley bobmar...@shiftmail.org wrote: On 10/10/2014 03:58, Chris Murphy wrote: * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery. No way! I wouldn't want a default like that. If you think at distributed transactions: suppose a sync was issued on both sides of a distributed transaction, then power was lost on one side What distributed transactions? Btrfs is not a clustered filesystem[1], it does not support and likely will never support being mounted from multiple hosts at the same time. [1]http://en.wikipedia.org/wiki/Clustered_file_system This is not the only way to do a distributed transaction. Databases can be hosted on the filesystem, and those can do distributed transations. Think of two bank accounts, one on btrfs fs1 here, and another bank account on database on a whatever filesystem in another country. You want to debit one account and credit the other one: the filesystems at the two sides *must not rollback their state* !! (especially not transparently without human intervention) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
On Oct 10, 2014, at 6:53 AM, Bob Marley bobmar...@shiftmail.org wrote: On 10/10/2014 03:58, Chris Murphy wrote: * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery. No way! I wouldn't want a default like that. If you think at distributed transactions: suppose a sync was issued on both sides of a distributed transaction, then power was lost on one side, than btrfs had corruption. When I remount it, definitely the worst thing that can happen is that it auto-rolls-back to a previous known-good state. For a general purpose file system, losing 30 seconds (or less) of questionably committed data, likely corrupt, is a file system that won't mount without user intervention, which requires a secret decoder ring to get it to mount at all. And may require the use of specialized tools to retrieve that data in any case. The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
If -o recovery is necessary, then you're either running into a btrfs bug, or your hardware is lying about when it has actually written things to disk. The first case isn't unheard of, although far less common than it used to be, and it should continue to improve with time. In the second case, you're potentially screwed regardless of the filesystem, without doing hacks like wait a good long time before returning from fsync in the hopes that the disk might actually have gotten around to performing the write it said had already finished. On Fri, Oct 10, 2014 at 5:12 AM, Bob Marley bobmar...@shiftmail.org wrote: On 10/10/2014 12:59, Roman Mamedov wrote: On Fri, 10 Oct 2014 12:53:38 +0200 Bob Marley bobmar...@shiftmail.org wrote: On 10/10/2014 03:58, Chris Murphy wrote: * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery. No way! I wouldn't want a default like that. If you think at distributed transactions: suppose a sync was issued on both sides of a distributed transaction, then power was lost on one side What distributed transactions? Btrfs is not a clustered filesystem[1], it does not support and likely will never support being mounted from multiple hosts at the same time. [1]http://en.wikipedia.org/wiki/Clustered_file_system This is not the only way to do a distributed transaction. Databases can be hosted on the filesystem, and those can do distributed transations. Think of two bank accounts, one on btrfs fs1 here, and another bank account on database on a whatever filesystem in another country. You want to debit one account and credit the other one: the filesystems at the two sides *must not rollback their state* !! (especially not transparently without human intervention) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
On 10/10/2014 16:37, Chris Murphy wrote: The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot. A filesystem which is suited for general purpose use is a filesystem which honors fsync, and doesn't *ever* auto-roll-back without user intervention. Anything different is not suited for database transactions at all. Any paid service which has the users database on btrfs is going to be at risk of losing payments, and probably without the company even knowing. If btrfs goes this way I hope a big warning is written on the wiki and on the manpages telling that this filesystem is totally unsuitable for hosting databases performing transactions. At most I can suggest that a flag in the metadata be added to allow/disallow auto-roll-back-on-error on such filesystem, so people can decide the tolerant vs. transaction-safe mode at filesystem creation. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
On 2014-10-10 19:43, Bob Marley wrote: On 10/10/2014 16:37, Chris Murphy wrote: The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot. A filesystem which is suited for general purpose use is a filesystem which honors fsync, and doesn't *ever* auto-roll-back without user intervention. A file system cannot do anything about the *DISKS* not honouring a sync command. That's what the PP was talking about. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
On 2014-10-10 13:43, Bob Marley wrote: On 10/10/2014 16:37, Chris Murphy wrote: The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot. A filesystem which is suited for general purpose use is a filesystem which honors fsync, and doesn't *ever* auto-roll-back without user intervention. Anything different is not suited for database transactions at all. Any paid service which has the users database on btrfs is going to be at risk of losing payments, and probably without the company even knowing. If btrfs goes this way I hope a big warning is written on the wiki and on the manpages telling that this filesystem is totally unsuitable for hosting databases performing transactions. If they need reliability, they should have some form of redundancy in-place and/or run the database directly on the block device; because even ext4, XFS, and pretty much every other filesystem can lose data sometimes, the difference being that those tend to give worse results when hardware is misbehaving than BTRFS does, because BTRFS usually has a old copy of whatever data structure gets corrupted to fall back on. Also, you really shouldn't be running databases on a BTRFS filesystem at the moment anyway, because of the significant performance implications. At most I can suggest that a flag in the metadata be added to allow/disallow auto-roll-back-on-error on such filesystem, so people can decide the tolerant vs. transaction-safe mode at filesystem creation. The problem with this is that if the auto-recovery code did run (and IMHO the kernel should spit out a warning to the system log whenever it does), then chances are that you wouldn't have had a consistent view if you had prevented it from running either; and, if the database is properly distributed/replicated, then it should recover by itself. smime.p7s Description: S/MIME Cryptographic Signature
Re: What is the vision for btrfs fs repair?
On 10/10/14 2:35 PM, Austin S Hemmelgarn wrote: On 2014-10-10 13:43, Bob Marley wrote: On 10/10/2014 16:37, Chris Murphy wrote: The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot. A filesystem which is suited for general purpose use is a filesystem which honors fsync, and doesn't *ever* auto-roll-back without user intervention. Anything different is not suited for database transactions at all. Any paid service which has the users database on btrfs is going to be at risk of losing payments, and probably without the company even knowing. If btrfs goes this way I hope a big warning is written on the wiki and on the manpages telling that this filesystem is totally unsuitable for hosting databases performing transactions. If they need reliability, they should have some form of redundancy in-place and/or run the database directly on the block device; because even ext4, XFS, and pretty much every other filesystem can lose data sometimes, Not if i.e. fsync returns. If the data is gone later, it's a hardware problem, or occasionally a bug - bugs that are usually found fixed pretty quickly. the difference being that those tend to give worse results when hardware is misbehaving than BTRFS does, because BTRFS usually has a old copy of whatever data structure gets corrupted to fall back on. I'm curious, is that based on conjecture or real-world testing? -Eric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
On 2014-10-08 15:11, Eric Sandeen wrote: I was looking at Marc's post: http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html and it feels like there isn't exactly a cohesive, overarching vision for repair of a corrupted btrfs filesystem. In other words - I'm an admin cruising along, when the kernel throws some fs corruption error, or for whatever reason btrfs fails to mount. What should I do? Marc lays out several steps, but to me this highlights that there seem to be a lot of disjoint mechanisms out there to deal with these problems; mostly from Marc's blog, with some bits of my own: * btrfs scrub Errors are corrected along if possible (what *is* possible?) * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. * mount -o degraded Allow mounts to continue with missing devices. (This isn't really a way to recover from corruption, right?) * btrfs-zero-log remove the log tree if log tree is corrupt * btrfs rescue Recover a damaged btrfs filesystem chunk-recover super-recover How does this relate to btrfs check? * btrfs check repair a btrfs filesystem --repair --init-csum-tree --init-extent-tree How does this relate to btrfs rescue? * btrfs restore try to salvage files from a damaged filesystem (not really repair, it's disk-scraping) What's the vision for, say, scrub vs. check vs. rescue? Should they repair the same errors, only online vs. offline? If not, what class of errors does one fix vs. the other? How would an admin know? Can btrfs check recover a bad tree root in the same way that mount -o recovery does? How would I know if I should use --init-*-tree, or chunk-recover, and what are the ramifications of using these options? It feels like recovery tools have been badly splintered, and if there's an overarching design or vision for btrfs fs repair, I can't tell what it is. Can anyone help me? Well, based on my understanding: * btrfs scrub is intended to be almost exactly equivalent to scrubbing a RAID volume; that is, it fixes disparity between multiple copies of the same block. IOW, it isn't really repair per se, but more preventative maintnence. Currently, it only works for cases where you have multiple copies of a block (dup, raid1, and raid10 profiles), but support is planned for error correction of raid5 and raid6 profiles. * mount -o recovery I don't know much about, but AFAICT, it s more for dealing with metadata related FS corruption. * mount -o degraded is used to mount a fs configured for a raid storage profile with fewer devices than the profile minimum. It's primarily so that you can get the fs into a state where you can run 'btrfs device replace' * btrfs-zero-log only deals with log tree corruption. This would be roughly equivalent to zeroing out the journal on an XFS or ext4 filesystem, and should almost never be needed. * btrfs rescue is intended for low level recovery corruption on an offline fs. * chunk-recover I'm not entirely sure about, but I believe it's like scrub for a single chunk on an offline fs * super-recover is for dealing with corrupted superblocks, and tries to replace it with one of the other copies (which hopefully isn't corrupted) * btrfs check is intended to (eventually) be equivalent to the fsck utility for most other filesystems. Currently, it's relatively good at identifying corruption, but less so at actually fixing it. There are however, some things that it won't catch, like a superblock pointing to a corrupted root tree. * btrfs restore is essentially disk scraping, but with built-in knowledge of the filesystem's on-disk structure, which makes it more reliable than more generic tools like scalpel for files that are too big to fit in the metadata blocks, and it is pretty much essential for dealing with transparently compressed files. In general, my personal procedure for handling a misbehaving BTRFS filesystem is: * Run btrfs check on it WITHOUT ANY OTHER OPTIONS to try to identify what's wrong * Try mounting it using -o recovery * Try mounting it using -o ro,recovery * Use -o degraded only if it's a BTRFS raid set that lost a disk * If btrfs check AND dmesg both seem to indicate that the log tree is corrupt, try btrfs-zero-log * If btrfs check indicated a corrupt superblock, try btrfs rescue super-recover * If all of the above fails, ask for advice on the mailing list or IRC Also, you should be running btrfs scrub regularly to correct bit-rot and force remapping of blocks with read errors. While BTRFS technically handles both transparently on reads, it only corrects thing on disk when you do a scrub. smime.p7s Description: S/MIME Cryptographic Signature
Re: What is the vision for btrfs fs repair?
Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as excerpted: Also, you should be running btrfs scrub regularly to correct bit-rot and force remapping of blocks with read errors. While BTRFS technically handles both transparently on reads, it only corrects thing on disk when you do a scrub. AFAIK that isn't quite correct. Currently, the number of copies is limited to two, meaning if one of the two is bad, there's a 50% chance of btrfs reading the good one on first try. If btrfs reads the good copy, it simply uses it. If btrfs reads the bad one, it checks the other one and assuming it's good, replaces the bad one with the good one both for the read (which otherwise errors out), and by overwriting the bad one. But here's the rub. The chances of detecting that bad block are relatively low in most cases. First, the system must try reading it for some reason, but even then, chances are 50% it'll pick the good one and won't even notice the bad one. Thus, while btrfs may randomly bump into a bad block and rewrite it with the good copy, scrub is the only way to systematically detect and (if there's a good copy) fix these checksum errors. It's not that btrfs doesn't do it if it finds them, it's that the chances of finding them are relatively low, unless you do a scrub, which systematically checks the entire filesystem (well, other than files marked nocsum, or nocow, which implies nocsum, or files written when mounted with nodatacow or nodatasum). At least that's the way it /should/ work. I guess it's possible that btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but if so, that's the first /I/ remember reading of it. Other than that detail, what you posted matches my knowledge and experience, such as it may be as a non-dev list regular, as well. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
On Thu, Oct 09, 2014 at 11:53:23AM +, Duncan wrote: Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as excerpted: Also, you should be running btrfs scrub regularly to correct bit-rot and force remapping of blocks with read errors. While BTRFS technically handles both transparently on reads, it only corrects thing on disk when you do a scrub. AFAIK that isn't quite correct. Currently, the number of copies is limited to two, meaning if one of the two is bad, there's a 50% chance of btrfs reading the good one on first try. Scrub checks both copies, though. It's ordinary reads that don't. Hugo. If btrfs reads the good copy, it simply uses it. If btrfs reads the bad one, it checks the other one and assuming it's good, replaces the bad one with the good one both for the read (which otherwise errors out), and by overwriting the bad one. But here's the rub. The chances of detecting that bad block are relatively low in most cases. First, the system must try reading it for some reason, but even then, chances are 50% it'll pick the good one and won't even notice the bad one. Thus, while btrfs may randomly bump into a bad block and rewrite it with the good copy, scrub is the only way to systematically detect and (if there's a good copy) fix these checksum errors. It's not that btrfs doesn't do it if it finds them, it's that the chances of finding them are relatively low, unless you do a scrub, which systematically checks the entire filesystem (well, other than files marked nocsum, or nocow, which implies nocsum, or files written when mounted with nodatacow or nodatasum). At least that's the way it /should/ work. I guess it's possible that btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but if so, that's the first /I/ remember reading of it. Other than that detail, what you posted matches my knowledge and experience, such as it may be as a non-dev list regular, as well. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Great oxymorons of the world, no. 7: The Simple Truth --- signature.asc Description: Digital signature
Re: What is the vision for btrfs fs repair?
On 2014-10-09 07:53, Duncan wrote: Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as excerpted: Also, you should be running btrfs scrub regularly to correct bit-rot and force remapping of blocks with read errors. While BTRFS technically handles both transparently on reads, it only corrects thing on disk when you do a scrub. AFAIK that isn't quite correct. Currently, the number of copies is limited to two, meaning if one of the two is bad, there's a 50% chance of btrfs reading the good one on first try. If btrfs reads the good copy, it simply uses it. If btrfs reads the bad one, it checks the other one and assuming it's good, replaces the bad one with the good one both for the read (which otherwise errors out), and by overwriting the bad one. But here's the rub. The chances of detecting that bad block are relatively low in most cases. First, the system must try reading it for some reason, but even then, chances are 50% it'll pick the good one and won't even notice the bad one. Thus, while btrfs may randomly bump into a bad block and rewrite it with the good copy, scrub is the only way to systematically detect and (if there's a good copy) fix these checksum errors. It's not that btrfs doesn't do it if it finds them, it's that the chances of finding them are relatively low, unless you do a scrub, which systematically checks the entire filesystem (well, other than files marked nocsum, or nocow, which implies nocsum, or files written when mounted with nodatacow or nodatasum). At least that's the way it /should/ work. I guess it's possible that btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but if so, that's the first /I/ remember reading of it. I'm not 100% certain, but I believe it doesn't actually fix things on disk when it detects an error during a read, I know it doesn't it the fs is mounted ro (even if the media is writable), because I did some testing to see how 'read-only' mounting a btrfs filesystem really is. Also, that's a much better description of how multiple copies work than I could probably have ever given. smime.p7s Description: S/MIME Cryptographic Signature
Re: What is the vision for btrfs fs repair?
On Thu, Oct 09, 2014 at 08:07:51AM -0400, Austin S Hemmelgarn wrote: On 2014-10-09 07:53, Duncan wrote: Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as excerpted: Also, you should be running btrfs scrub regularly to correct bit-rot and force remapping of blocks with read errors. While BTRFS technically handles both transparently on reads, it only corrects thing on disk when you do a scrub. AFAIK that isn't quite correct. Currently, the number of copies is limited to two, meaning if one of the two is bad, there's a 50% chance of btrfs reading the good one on first try. If btrfs reads the good copy, it simply uses it. If btrfs reads the bad one, it checks the other one and assuming it's good, replaces the bad one with the good one both for the read (which otherwise errors out), and by overwriting the bad one. But here's the rub. The chances of detecting that bad block are relatively low in most cases. First, the system must try reading it for some reason, but even then, chances are 50% it'll pick the good one and won't even notice the bad one. Thus, while btrfs may randomly bump into a bad block and rewrite it with the good copy, scrub is the only way to systematically detect and (if there's a good copy) fix these checksum errors. It's not that btrfs doesn't do it if it finds them, it's that the chances of finding them are relatively low, unless you do a scrub, which systematically checks the entire filesystem (well, other than files marked nocsum, or nocow, which implies nocsum, or files written when mounted with nodatacow or nodatasum). At least that's the way it /should/ work. I guess it's possible that btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but if so, that's the first /I/ remember reading of it. I'm not 100% certain, but I believe it doesn't actually fix things on disk when it detects an error during a read, I'm fairly sure it does, as I've had it happen to me. :) I know it doesn't it the fs is mounted ro (even if the media is writable), because I did some testing to see how 'read-only' mounting a btrfs filesystem really is. If the FS is RO, then yes, it won't fix things. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Great films about cricket: Interview with the Umpire --- signature.asc Description: Digital signature
Re: What is the vision for btrfs fs repair?
On 2014-10-09 08:12, Hugo Mills wrote: On Thu, Oct 09, 2014 at 08:07:51AM -0400, Austin S Hemmelgarn wrote: On 2014-10-09 07:53, Duncan wrote: Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as excerpted: Also, you should be running btrfs scrub regularly to correct bit-rot and force remapping of blocks with read errors. While BTRFS technically handles both transparently on reads, it only corrects thing on disk when you do a scrub. AFAIK that isn't quite correct. Currently, the number of copies is limited to two, meaning if one of the two is bad, there's a 50% chance of btrfs reading the good one on first try. If btrfs reads the good copy, it simply uses it. If btrfs reads the bad one, it checks the other one and assuming it's good, replaces the bad one with the good one both for the read (which otherwise errors out), and by overwriting the bad one. But here's the rub. The chances of detecting that bad block are relatively low in most cases. First, the system must try reading it for some reason, but even then, chances are 50% it'll pick the good one and won't even notice the bad one. Thus, while btrfs may randomly bump into a bad block and rewrite it with the good copy, scrub is the only way to systematically detect and (if there's a good copy) fix these checksum errors. It's not that btrfs doesn't do it if it finds them, it's that the chances of finding them are relatively low, unless you do a scrub, which systematically checks the entire filesystem (well, other than files marked nocsum, or nocow, which implies nocsum, or files written when mounted with nodatacow or nodatasum). At least that's the way it /should/ work. I guess it's possible that btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but if so, that's the first /I/ remember reading of it. I'm not 100% certain, but I believe it doesn't actually fix things on disk when it detects an error during a read, I'm fairly sure it does, as I've had it happen to me. :) I probably just misinterpreted the source code, while I know enough C to generally understand things, I'm by far no expert. I know it doesn't it the fs is mounted ro (even if the media is writable), because I did some testing to see how 'read-only' mounting a btrfs filesystem really is. If the FS is RO, then yes, it won't fix things. Hugo. smime.p7s Description: S/MIME Cryptographic Signature
Re: What is the vision for btrfs fs repair?
On Thu, 09 Oct 2014 08:07:51 -0400 Austin S Hemmelgarn ahferro...@gmail.com wrote: On 2014-10-09 07:53, Duncan wrote: Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as excerpted: Also, you should be running btrfs scrub regularly to correct bit-rot and force remapping of blocks with read errors. While BTRFS technically handles both transparently on reads, it only corrects thing on disk when you do a scrub. AFAIK that isn't quite correct. Currently, the number of copies is limited to two, meaning if one of the two is bad, there's a 50% chance of btrfs reading the good one on first try. If btrfs reads the good copy, it simply uses it. If btrfs reads the bad one, it checks the other one and assuming it's good, replaces the bad one with the good one both for the read (which otherwise errors out), and by overwriting the bad one. But here's the rub. The chances of detecting that bad block are relatively low in most cases. First, the system must try reading it for some reason, but even then, chances are 50% it'll pick the good one and won't even notice the bad one. Thus, while btrfs may randomly bump into a bad block and rewrite it with the good copy, scrub is the only way to systematically detect and (if there's a good copy) fix these checksum errors. It's not that btrfs doesn't do it if it finds them, it's that the chances of finding them are relatively low, unless you do a scrub, which systematically checks the entire filesystem (well, other than files marked nocsum, or nocow, which implies nocsum, or files written when mounted with nodatacow or nodatasum). At least that's the way it /should/ work. I guess it's possible that btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but if so, that's the first /I/ remember reading of it. I'm not 100% certain, but I believe it doesn't actually fix things on disk when it detects an error during a read, I know it doesn't it the fs is mounted ro (even if the media is writable), because I did some testing to see how 'read-only' mounting a btrfs filesystem really is. Definitely it won't with a read-only mount. But then scrub shouldn't be able to write to a read-only mount either. The only way a read-only mount should be writable is if it's mounted (bind-mounted or btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to that mount, not the read-only mounted location. There's even debate about replaying the journal or doing orphan-delete on read-only mounts (at least on-media, the change could, and arguably should, occur in RAM and be cached, marking the cache dirty at the same time so it's appropriately flushed if/when the filesystem goes writable), with some arguing read-only means just that, don't write /anything/ to it until it's read-write mounted. But writable-mounted, detected checksum errors (with a good copy available) should be rewritten as far as I know. If not, I'd call it a bug. The problem is in the detection, not in the rewriting. Scrub's the only way to reliably detect these errors since it's the only thing that systematically checks /everything/. Also, that's a much better description of how multiple copies work than I could probably have ever given. Thanks. =:^) -- Duncan - No HTML messages please, as they are filtered as spam. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
On Thu, 9 Oct 2014 12:55:50 +0100 Hugo Mills h...@carfax.org.uk wrote: On Thu, Oct 09, 2014 at 11:53:23AM +, Duncan wrote: Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as excerpted: Also, you should be running btrfs scrub regularly to correct bit-rot and force remapping of blocks with read errors. While BTRFS technically handles both transparently on reads, it only corrects thing on disk when you do a scrub. AFAIK that isn't quite correct. Currently, the number of copies is limited to two, meaning if one of the two is bad, there's a 50% chance of btrfs reading the good one on first try. Scrub checks both copies, though. It's ordinary reads that don't. While I believe I was clear in full context (see below), agreed. I was talking about normal reads in the above, not scrub, as the full quote should make clear. I guess I could have made it clearer in the immediate context, however. Thanks. Thus, while btrfs may randomly bump into a bad block and rewrite it with the good copy, scrub is the only way to systematically detect and (if there's a good copy) fix these checksum errors. -- Duncan - No HTML messages please, as they are filtered as spam. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
On 2014-10-09 08:34, Duncan wrote: On Thu, 09 Oct 2014 08:07:51 -0400 Austin S Hemmelgarn ahferro...@gmail.com wrote: On 2014-10-09 07:53, Duncan wrote: Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as excerpted: Also, you should be running btrfs scrub regularly to correct bit-rot and force remapping of blocks with read errors. While BTRFS technically handles both transparently on reads, it only corrects thing on disk when you do a scrub. AFAIK that isn't quite correct. Currently, the number of copies is limited to two, meaning if one of the two is bad, there's a 50% chance of btrfs reading the good one on first try. If btrfs reads the good copy, it simply uses it. If btrfs reads the bad one, it checks the other one and assuming it's good, replaces the bad one with the good one both for the read (which otherwise errors out), and by overwriting the bad one. But here's the rub. The chances of detecting that bad block are relatively low in most cases. First, the system must try reading it for some reason, but even then, chances are 50% it'll pick the good one and won't even notice the bad one. Thus, while btrfs may randomly bump into a bad block and rewrite it with the good copy, scrub is the only way to systematically detect and (if there's a good copy) fix these checksum errors. It's not that btrfs doesn't do it if it finds them, it's that the chances of finding them are relatively low, unless you do a scrub, which systematically checks the entire filesystem (well, other than files marked nocsum, or nocow, which implies nocsum, or files written when mounted with nodatacow or nodatasum). At least that's the way it /should/ work. I guess it's possible that btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but if so, that's the first /I/ remember reading of it. I'm not 100% certain, but I believe it doesn't actually fix things on disk when it detects an error during a read, I know it doesn't it the fs is mounted ro (even if the media is writable), because I did some testing to see how 'read-only' mounting a btrfs filesystem really is. Definitely it won't with a read-only mount. But then scrub shouldn't be able to write to a read-only mount either. The only way a read-only mount should be writable is if it's mounted (bind-mounted or btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to that mount, not the read-only mounted location. In theory yes, but there are caveats to this, namely: * atime updates still happen unless you have mounted the fs with noatime * The superblock gets updated if there are 'any' writes * The free space cache 'might' be updated if there are any writes All in all, a BTRFS filesystem mounted ro is much more read-only than say ext4 (which at least updates the sb, and old versions replayed the journal, in addition to the atime updates). There's even debate about replaying the journal or doing orphan-delete on read-only mounts (at least on-media, the change could, and arguably should, occur in RAM and be cached, marking the cache dirty at the same time so it's appropriately flushed if/when the filesystem goes writable), with some arguing read-only means just that, don't write /anything/ to it until it's read-write mounted. But writable-mounted, detected checksum errors (with a good copy available) should be rewritten as far as I know. If not, I'd call it a bug. The problem is in the detection, not in the rewriting. Scrub's the only way to reliably detect these errors since it's the only thing that systematically checks /everything/. Also, that's a much better description of how multiple copies work than I could probably have ever given. Thanks. =:^) smime.p7s Description: S/MIME Cryptographic Signature
Re: What is the vision for btrfs fs repair?
Austin S Hemmelgarn posted on Thu, 09 Oct 2014 09:18:22 -0400 as excerpted: On 2014-10-09 08:34, Duncan wrote: The only way a read-only mount should be writable is if it's mounted (bind-mounted or btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to that mount, not the read-only mounted location. In theory yes, but there are caveats to this, namely: * atime updates still happen unless you have mounted the fs with noatime I've been mounting noatime for well over a decade now, exactly due to such problems. But I believe at least /some/ filesystems are truly read- only when they're mounted as such, and atime updates don't happen on them. These days I actually apply a patch that changes the default relatime to noatime, so I don't even have to have it in my mount-options. =:^) * The superblock gets updated if there are 'any' writes Yeah. At least in theory, there shouldn't be, however. As I said, in theory, even journal replay and orphan delete shouldn't hit media, altho handling it in memory and dirtying the cache, so if the filesystem is ever remounted read-write they get written, is reasonable. * The free space cache 'might' be updated if there are any writes Makes sense. But of course that's what I'm arguing, there shouldn't /be/ any writes. Read-only should mean exactly that, don't touch media, period. I remember at one point activating an mdraid1 degraded, read-only, just a single device of the 4-way raid1 I was running at the time, to recover data from it after the system it was running in died. The idea was don't write to the device at all, because I was still testing the new system, and in case I decided to try to reassemble the raid at some point. Read- only really NEEDS to be read-only, under such conditions. Similarly for forensic examination, of course. If there's a write, any write, it's evidence tampering. Read-only needs to MEAN read-only! -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
On 10/9/14 8:49 AM, Duncan wrote: Austin S Hemmelgarn posted on Thu, 09 Oct 2014 09:18:22 -0400 as excerpted: On 2014-10-09 08:34, Duncan wrote: The only way a read-only mount should be writable is if it's mounted (bind-mounted or btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to that mount, not the read-only mounted location. In theory yes, but there are caveats to this, namely: * atime updates still happen unless you have mounted the fs with noatime Getting off the topic a bit, but that really shouldn't happen: #define IS_NOATIME(inode) __IS_FLG(inode, MS_RDONLY|MS_NOATIME) and in touch_atime(): if (IS_NOATIME(inode)) return; -Eric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
On Oct 8, 2014, at 3:11 PM, Eric Sandeen sand...@redhat.com wrote: I was looking at Marc's post: http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html and it feels like there isn't exactly a cohesive, overarching vision for repair of a corrupted btrfs filesystem. It's definitely confusing compared to any other filesystem I've used on four different platforms. And that's when excluding scraping and the functions unique to any multiple device volume: scrubs, degraded mount. To be fair, mdadm doesn't even have a scrub command, it's done via 'echo check /sys/block/mdX/md/sync_action'. And meanwhile LVM has pvck, vgck, and for scrubs it's lvchange --syncaction {check|repair}. These are also completely non-obvious. * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery. * btrfs-zero-log remove the log tree if log tree is corrupt * btrfs rescue Recover a damaged btrfs filesystem chunk-recover super-recover How does this relate to btrfs check? * btrfs check repair a btrfs filesystem --repair --init-csum-tree --init-extent-tree How does this relate to btrfs rescue? These three translate into eight combinations of repairs, adding -o recovery there are 9 combinations. I think this is the main source of confusion, there are just too many options, but also it's completely non-obvious which one to use in which situation. My expectation is that eventually these get consolidated into just check and check --repair. As the repair code matures, it'd go into kernel autorecovery code. That's a guess on my part, but it's consistent with design goals. It feels like recovery tools have been badly splintered, and if there's an overarching design or vision for btrfs fs repair, I can't tell what it is. Can anyone help me? I suspect it's unintended splintering, and is an artifact that will go away. I'd rather the convoluted, fractured nature of repair go away before the scary experimental warnings do. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is the vision for btrfs fs repair?
Chris Murphy posted on Thu, 09 Oct 2014 21:58:53 -0400 as excerpted: I suspect it's unintended splintering, and is an artifact that will go away. I'd rather the convoluted, fractured nature of repair go away before the scary experimental warnings do. Heh, agreed with everything[1], but too late for this, the experimental warnings are peeled off, the experimental or at least horribly immature /behavior/ remains. =:^( --- [1] ... and a much more logically cohesive and well structured reply than I could have managed as my own thoughts simply weren't that well organized. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
What is the vision for btrfs fs repair?
I was looking at Marc's post: http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html and it feels like there isn't exactly a cohesive, overarching vision for repair of a corrupted btrfs filesystem. In other words - I'm an admin cruising along, when the kernel throws some fs corruption error, or for whatever reason btrfs fails to mount. What should I do? Marc lays out several steps, but to me this highlights that there seem to be a lot of disjoint mechanisms out there to deal with these problems; mostly from Marc's blog, with some bits of my own: * btrfs scrub Errors are corrected along if possible (what *is* possible?) * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. * mount -o degraded Allow mounts to continue with missing devices. (This isn't really a way to recover from corruption, right?) * btrfs-zero-log remove the log tree if log tree is corrupt * btrfs rescue Recover a damaged btrfs filesystem chunk-recover super-recover How does this relate to btrfs check? * btrfs check repair a btrfs filesystem --repair --init-csum-tree --init-extent-tree How does this relate to btrfs rescue? * btrfs restore try to salvage files from a damaged filesystem (not really repair, it's disk-scraping) What's the vision for, say, scrub vs. check vs. rescue? Should they repair the same errors, only online vs. offline? If not, what class of errors does one fix vs. the other? How would an admin know? Can btrfs check recover a bad tree root in the same way that mount -o recovery does? How would I know if I should use --init-*-tree, or chunk-recover, and what are the ramifications of using these options? It feels like recovery tools have been badly splintered, and if there's an overarching design or vision for btrfs fs repair, I can't tell what it is. Can anyone help me? Thanks, -Eric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html