Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-30 02:33, Duncan wrote: Austin S. Hemmelgarn posted on Fri, 29 Jun 2018 14:31:04 -0400 as excerpted: On 2018-06-29 13:58, james harvey wrote: On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn wrote: On 2018-06-29 11:15, james harvey wrote: On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy wrote: And an open question I have about scrub is weather it only ever is checking csums, meaning nodatacow files are never scrubbed, or if the copies are at least compared to each other? Scrub never looks at nodatacow files. It does not compare the copies to each other. Qu submitted a patch to make check compare the copies: https://patchwork.kernel.org/patch/10434509/ This hasn't been added to btrfs-progs git yet. IMO, I think the offline check should look at nodatacow copies like this, but I still think this also needs to be added to scrub. In the patch thread, I discuss my reasons why. In brief: online scanning; this goes along with user's expectation of scrub ensuring mirrored data integrity; and recommendations to setup scrub on periodic basis to me means it's the place to put it. That said, it can't sanely fix things if there is a mismatch. At least, not unless BTRFS gets proper generational tracking to handle temporarily missing devices. As of right now, sanely fixing things requires significant manual intervention, as you have to bypass the device read selection algorithm to be able to look at the state of the individual copies so that you can pick one to use and forcibly rewrite the whole file by hand. Absolutely. User would need to use manual intervention as you describe, or restore the single file(s) from backup. But, it's a good opportunity to tell the user they had partial data corruption, even if it can't be auto-fixed. Otherwise they get intermittent data corruption, depending on which copies are read. The thing is though, as things stand right now, you need to manually edit the data on-disk directly or restore the file from a backup to fix the file. While it's technically true that you can manually repair this type of thing, both of the cases for doing it without those patches I mentioned, it's functionally impossible for a regular user to do it without potentially losing some data. [Usual backups rant, user vs. admin variant, nowcow/tmpfs edition. Regulars can skip as the rest is already predicted from past posts, for them. =;^] "Regular user"? "Regular users" don't need to bother with this level of detail. They simply get their "admin" to do it, even if that "admin" is their kid, or the kid from next door that's good with computers, or the geek squad (aka nsa-agent-squad) guy/gal, doing it... or telling them to install "a real OS", meaning whatever MS/Apple/Google something that they know how to deal with. If the "user" is dealing with setting nocow, choosing btrfs in the first place, etc, then they're _not_ a "regular user" by definition, they're already an admin.I'd argue that that's not always true. 'Regular users' also bli9ndly follow advice they find online about how to make their system run better, and quite often don't keep backups. And as any admin learns rather quickly, the value of data is defined by the number of backups it's worth having of that data. Which means it's not a problem. Either the data had a backup and it's (reasonably) trivial to restore the data from that backup, or the data was defined by lack of having that backup as of only trivial value, so low as to not be worth the time/trouble/resources necessary to make that backup in the first place. Which of course means what was defined as of most value, either the data of there was a backup, or the time/trouble/resources that would have gone into creating it if not, is *always* saved. (And of course the same goes for "I had a backup, but it's old", except in this case it's the value of the data delta between the backup and current. As soon as it's worth more than the time/trouble/hassle of updating the backup, it will by definition be updated. Not having a newer backup available thus simply means the value of the data that changed between the last backup and current was simply not enough to justify updating the backup, and again, what was of most value is *always* saved, either the data, or the time that would have otherwise gone into making the newer backup.) Because while a "regular user" may not know it because it's not his /job/ to know it, if there's anything an admin knows *well* it's that the working copy of data **WILL** be damaged. It's not a matter of if, but of when, and of whether it'll be a fat-finger mistake, or a hardware or software failure, or wetware (theft, ransomware, etc), or wetware (flood, fire and the water that put it out damage, etc), tho none of that actually matters after all, because in the end, the only thing that matters was how the value of that data was defined by the number of backups made of it, and how quickly and conveniently at
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
Austin S. Hemmelgarn posted on Fri, 29 Jun 2018 14:31:04 -0400 as excerpted: > On 2018-06-29 13:58, james harvey wrote: >> On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn >> wrote: >>> On 2018-06-29 11:15, james harvey wrote: On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy wrote: > > And an open question I have about scrub is weather it only ever is > checking csums, meaning nodatacow files are never scrubbed, or if > the copies are at least compared to each other? Scrub never looks at nodatacow files. It does not compare the copies to each other. Qu submitted a patch to make check compare the copies: https://patchwork.kernel.org/patch/10434509/ This hasn't been added to btrfs-progs git yet. IMO, I think the offline check should look at nodatacow copies like this, but I still think this also needs to be added to scrub. In the patch thread, I discuss my reasons why. In brief: online scanning; this goes along with user's expectation of scrub ensuring mirrored data integrity; and recommendations to setup scrub on periodic basis to me means it's the place to put it. >>> >>> That said, it can't sanely fix things if there is a mismatch. At >>> least, >>> not unless BTRFS gets proper generational tracking to handle >>> temporarily missing devices. As of right now, sanely fixing things >>> requires significant manual intervention, as you have to bypass the >>> device read selection algorithm to be able to look at the state of the >>> individual copies so that you can pick one to use and forcibly rewrite >>> the whole file by hand. >> >> Absolutely. User would need to use manual intervention as you >> describe, or restore the single file(s) from backup. But, it's a good >> opportunity to tell the user they had partial data corruption, even if >> it can't be auto-fixed. Otherwise they get intermittent data >> corruption, depending on which copies are read. > The thing is though, as things stand right now, you need to manually > edit the data on-disk directly or restore the file from a backup to fix > the file. While it's technically true that you can manually repair this > type of thing, both of the cases for doing it without those patches I > mentioned, it's functionally impossible for a regular user to do it > without potentially losing some data. [Usual backups rant, user vs. admin variant, nowcow/tmpfs edition. Regulars can skip as the rest is already predicted from past posts, for them. =;^] "Regular user"? "Regular users" don't need to bother with this level of detail. They simply get their "admin" to do it, even if that "admin" is their kid, or the kid from next door that's good with computers, or the geek squad (aka nsa-agent-squad) guy/gal, doing it... or telling them to install "a real OS", meaning whatever MS/Apple/Google something that they know how to deal with. If the "user" is dealing with setting nocow, choosing btrfs in the first place, etc, then they're _not_ a "regular user" by definition, they're already an admin. And as any admin learns rather quickly, the value of data is defined by the number of backups it's worth having of that data. Which means it's not a problem. Either the data had a backup and it's (reasonably) trivial to restore the data from that backup, or the data was defined by lack of having that backup as of only trivial value, so low as to not be worth the time/trouble/resources necessary to make that backup in the first place. Which of course means what was defined as of most value, either the data of there was a backup, or the time/trouble/resources that would have gone into creating it if not, is *always* saved. (And of course the same goes for "I had a backup, but it's old", except in this case it's the value of the data delta between the backup and current. As soon as it's worth more than the time/trouble/hassle of updating the backup, it will by definition be updated. Not having a newer backup available thus simply means the value of the data that changed between the last backup and current was simply not enough to justify updating the backup, and again, what was of most value is *always* saved, either the data, or the time that would have otherwise gone into making the newer backup.) Because while a "regular user" may not know it because it's not his /job/ to know it, if there's anything an admin knows *well* it's that the working copy of data **WILL** be damaged. It's not a matter of if, but of when, and of whether it'll be a fat-finger mistake, or a hardware or software failure, or wetware (theft, ransomware, etc), or wetware (flood, fire and the water that put it out damage, etc), tho none of that actually matters after all, because in the end, the only thing that matters was how the value of that data was defined by the number of backups made of it, and how quickly and conveniently at least
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On Fri, Jun 29, 2018 at 9:15 AM, james harvey wrote: > On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy wrote: >> And an open question I have about scrub is weather it only ever is >> checking csums, meaning nodatacow files are never scrubbed, or if the >> copies are at least compared to each other? > > Scrub never looks at nodatacow files. It does not compare the copies > to each other. > > Qu submitted a patch to make check compare the copies: > https://patchwork.kernel.org/patch/10434509/ Yeah online scrub needs to report any mismatches, even if it can't do anything about it because it's ambiguous which copy is wrong. > IMO, I think the offline check should look at nodatacow copies like > this, but I still think this also needs to be added to scrub. In the > patch thread, I discuss my reasons why. In brief: online scanning; > this goes along with user's expectation of scrub ensuring mirrored > data integrity; and recommendations to setup scrub on periodic basis > to me means it's the place to put it. I don't mind this being implemented in offline scrub first for testing purposes. But the online scrub certainly should have this ability eventually. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-29 13:58, james harvey wrote: On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn wrote: On 2018-06-29 11:15, james harvey wrote: On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy wrote: And an open question I have about scrub is weather it only ever is checking csums, meaning nodatacow files are never scrubbed, or if the copies are at least compared to each other? Scrub never looks at nodatacow files. It does not compare the copies to each other. Qu submitted a patch to make check compare the copies: https://patchwork.kernel.org/patch/10434509/ This hasn't been added to btrfs-progs git yet. IMO, I think the offline check should look at nodatacow copies like this, but I still think this also needs to be added to scrub. In the patch thread, I discuss my reasons why. In brief: online scanning; this goes along with user's expectation of scrub ensuring mirrored data integrity; and recommendations to setup scrub on periodic basis to me means it's the place to put it. That said, it can't sanely fix things if there is a mismatch. At least, not unless BTRFS gets proper generational tracking to handle temporarily missing devices. As of right now, sanely fixing things requires significant manual intervention, as you have to bypass the device read selection algorithm to be able to look at the state of the individual copies so that you can pick one to use and forcibly rewrite the whole file by hand. Absolutely. User would need to use manual intervention as you describe, or restore the single file(s) from backup. But, it's a good opportunity to tell the user they had partial data corruption, even if it can't be auto-fixed. Otherwise they get intermittent data corruption, depending on which copies are read. The thing is though, as things stand right now, you need to manually edit the data on-disk directly or restore the file from a backup to fix the file. While it's technically true that you can manually repair this type of thing, both of the cases for doing it without those patches I mentioned, it's functionally impossible for a regular user to do it without potentially losing some data. Unless that changes, scrub telling you it's corrupt is not going to help much aside from making sure you don't make things worse by trying to use it. Given this, it would make sense to have a (disabled by default) option to have scrub repair it by just using the newer or older copy of the data. That would require classic RAID generational tracking though, which BTRFS doesn't have yet. A while back, Anand Jain posted some patches that would let you select a particular device to direct all reads to via a mount option, but I don't think they ever got merged. That would have made manual recovery in cases like this exponentially easier (mount read-only with one device selected, copy the file out somewhere, remount read-only with the other device, drop caches, copy the file out again, compare and reconcile the two copies, then remount the volume writable and write out the repaired file). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn wrote: > On 2018-06-29 11:15, james harvey wrote: >> >> On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy >> wrote: >>> >>> And an open question I have about scrub is weather it only ever is >>> checking csums, meaning nodatacow files are never scrubbed, or if the >>> copies are at least compared to each other? >> >> >> Scrub never looks at nodatacow files. It does not compare the copies >> to each other. >> >> Qu submitted a patch to make check compare the copies: >> https://patchwork.kernel.org/patch/10434509/ >> >> This hasn't been added to btrfs-progs git yet. >> >> IMO, I think the offline check should look at nodatacow copies like >> this, but I still think this also needs to be added to scrub. In the >> patch thread, I discuss my reasons why. In brief: online scanning; >> this goes along with user's expectation of scrub ensuring mirrored >> data integrity; and recommendations to setup scrub on periodic basis >> to me means it's the place to put it. > > That said, it can't sanely fix things if there is a mismatch. At least, not > unless BTRFS gets proper generational tracking to handle temporarily missing > devices. As of right now, sanely fixing things requires significant manual > intervention, as you have to bypass the device read selection algorithm to > be able to look at the state of the individual copies so that you can pick > one to use and forcibly rewrite the whole file by hand. Absolutely. User would need to use manual intervention as you describe, or restore the single file(s) from backup. But, it's a good opportunity to tell the user they had partial data corruption, even if it can't be auto-fixed. Otherwise they get intermittent data corruption, depending on which copies are read. > A while back, Anand Jain posted some patches that would let you select a > particular device to direct all reads to via a mount option, but I don't > think they ever got merged. That would have made manual recovery in cases > like this exponentially easier (mount read-only with one device selected, > copy the file out somewhere, remount read-only with the other device, drop > caches, copy the file out again, compare and reconcile the two copies, then > remount the volume writable and write out the repaired file). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-29 11:15, james harvey wrote: On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy wrote: And an open question I have about scrub is weather it only ever is checking csums, meaning nodatacow files are never scrubbed, or if the copies are at least compared to each other? Scrub never looks at nodatacow files. It does not compare the copies to each other. Qu submitted a patch to make check compare the copies: https://patchwork.kernel.org/patch/10434509/ This hasn't been added to btrfs-progs git yet. IMO, I think the offline check should look at nodatacow copies like this, but I still think this also needs to be added to scrub. In the patch thread, I discuss my reasons why. In brief: online scanning; this goes along with user's expectation of scrub ensuring mirrored data integrity; and recommendations to setup scrub on periodic basis to me means it's the place to put it. That said, it can't sanely fix things if there is a mismatch. At least, not unless BTRFS gets proper generational tracking to handle temporarily missing devices. As of right now, sanely fixing things requires significant manual intervention, as you have to bypass the device read selection algorithm to be able to look at the state of the individual copies so that you can pick one to use and forcibly rewrite the whole file by hand. A while back, Anand Jain posted some patches that would let you select a particular device to direct all reads to via a mount option, but I don't think they ever got merged. That would have made manual recovery in cases like this exponentially easier (mount read-only with one device selected, copy the file out somewhere, remount read-only with the other device, drop caches, copy the file out again, compare and reconcile the two copies, then remount the volume writable and write out the repaired file). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy wrote: > And an open question I have about scrub is weather it only ever is > checking csums, meaning nodatacow files are never scrubbed, or if the > copies are at least compared to each other? Scrub never looks at nodatacow files. It does not compare the copies to each other. Qu submitted a patch to make check compare the copies: https://patchwork.kernel.org/patch/10434509/ This hasn't been added to btrfs-progs git yet. IMO, I think the offline check should look at nodatacow copies like this, but I still think this also needs to be added to scrub. In the patch thread, I discuss my reasons why. In brief: online scanning; this goes along with user's expectation of scrub ensuring mirrored data integrity; and recommendations to setup scrub on periodic basis to me means it's the place to put it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018年06月29日 01:10, Andrei Borzenkov wrote: > 28.06.2018 12:15, Qu Wenruo пишет: >> >> >> On 2018年06月28日 16:16, Andrei Borzenkov wrote: >>> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo wrote: On 2018年06月28日 11:14, r...@georgianit.com wrote: > > > On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: > >> >> Please get yourself clear of what other raid1 is doing. > > A drive failure, where the drive is still there when the computer > reboots, is a situation that *any* raid 1, (or for that matter, raid 5, > raid 6, anything but raid 0) will recover from perfectly without raising > a sweat. Some will rebuild the array automatically, WOW, that's black magic, at least for RAID1. The whole RAID1 has no idea of which copy is correct unlike btrfs who has datasum. Don't bother other things, just tell me how to determine which one is correct? >>> >>> When one drive fails, it is recorded in meta-data on remaining drives; >>> probably configuration generation number is increased. Next time drive >>> with older generation is not incorporated. Hardware controllers also >>> keep this information in NVRAM and so do not even depend on scanning >>> of other disks. >> >> Yep, the only possible way to determine such case is from external info. >> >> For device generation, it's possible to enhance btrfs, but at least we >> could start from detect and refuse to RW mount to avoid possible further >> corruption. >> But anyway, if one really cares about such case, hardware RAID >> controller seems to be the only solution as other software may have the >> same problem. >> >> And the hardware solution looks pretty interesting, is the write to >> NVRAM 100% atomic? Even at power loss? >> >>> The only possibility is that, the misbehaved device missed several super block update so we have a chance to detect it's out-of-date. But that's not always working. >>> >>> Why it should not work as long as any write to array is suspended >>> until superblock on remaining devices is updated? >> >> What happens if there is no generation gap in device superblock? >> > > Well, you use "generation" in strict btrfs sense, I use "generation" > generically. That is exactly what btrfs apparently lacks currently - > some monotonic counter that is used to record such event. Indeed, btrfs doesn't have any way to record which device get degraded at all. The usage of btrfs device generation is already kind of workaround. So to keep the same behavior of mdraid/lvm, each time btrfs detects a device missing/fatal command (flush/fua) not executed correctly, btrfs needs to record it, maybe into its device item, and commit it to disk. In short, the btrfs csum makes us a little conceited about such device missing case, normally csum will tell us which data is wrong so we could avoid complex device status tracking. But apparently, if nodatasum is involved, everything just goes out of our expectation. > >> If one device got some of its (nodatacow) data written to disk, while >> the other device doesn't get data written, and neither of them reached >> super block update, there is no difference in device superblock, thus no >> way to detect which is correct. >> > > Again, the very fact that device failed should have triggered update of > superblock to record this information which presumably should increase > some counter. Indeed. > >>> If you're talking about missing generation check for btrfs, that's valid, but it's far from a "major design flaw", as there are a lot of cases where other RAID1 (mdraid or LVM mirrored) can also be affected (the brain-split case). >>> >>> That's different. Yes, with software-based raid there is usually no >>> way to detect outdated copy if no other copies are present. Having >>> older valid data is still very different from corrupting newer data. >> >> While for VDI case (or any VM image file format other than raw), older >> valid data normally means corruption. >> Unless they have their own write-ahead log. >>> Some file format may detect such problem by themselves if they have >> internal checksum, but anyway, older data normally means corruption, >> especially when partial new and partial old. >> > > Yes, that's true. But there is really nothing that can be done here, > even theoretically; it hardly a reason to not do what looks possible. Well, theoretically, you can just use datasum and datacow :) Thanks, Qu > >> On the other hand, with data COW and csum, btrfs can ensure the whole >> filesystem update is atomic (at least for single device). >> So the title, especially the "major design flaw" can't be wrong any more. >> >>> > others will automatically kick out the misbehaving drive. *none* of them > will take back the the drive with old data and start commingling that > data with good copy.)\ This behaviour from BTRFS is completely abnormal.. > and
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On Thu, Jun 28, 2018 at 11:37 AM, Goffredo Baroncelli wrote: > Regarding your point 3), it must be point out that in case of NOCOW files, > even having the same transid it is not enough. It still be possible that a > copy is update before a power failure preventing the super-block update. > I think that the only way to prevent it to happens is: > 1) using a data journal (which means that each data is copied two times) > OR > 2) using a cow filesystem (with cow enabled of course !) There is no power failure in this example. So it's really off the table considering whether Btrfs or mdadm/lvm raid do better in the same situation with a nodatacow file. I think here is the problem in the Btrfs nodatacow case. Btrfs doesn't have a way of untrusting nodatacow files on a previously missing drive that hasn't been balanced. There is no such thing as nometadatacow, so no matter what it figures out there's a problem, and uses the good copy of metadata, but it never "marks" the previously missing device as suspicious. When it comes time to read a nodatacow file, Btrfs just blindly reads off one of the drives, it has no mechanism for questioning the formerly missing drive without csum. That is actually a really weird and unique kind of write hole for Btrfs raid1 when the data is nodatacow. I have to agree with Remi. This is a flaw in the design or bad bug, however you want to consider it. Because mdadm/lvm do not behave this way in the exact same situation. And an open question I have about scrub is weather it only ever is checking csums, meaning nodatacow files are never scrubbed, or if the copies are at least compared to each other? As for fixes: - During mount time, Btrfs sees from supers that there is a transid mismatch, to not read nodatacow files from the lower transid device until an auto balance has completed. Right now Btrfs doesn't have an abbreviated balance that "replays" the events between two transids. Basically it would work like send/receive but for balance to catch up a previously missing device. Right now we have to do a full balance which is a brutal penalty for a briefly missing drive. Again, mdadm and lvm do better here by default. - Fix the performance issues of COW with disk images. ZFS doesn't even have a nodatacow option and they're running VM images on ZFS and it doesn't sound like they're running into ridiculous performance penalties that makes it impractical to use. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On Thu, Jun 28, 2018 at 9:37 AM, Remi Gauvin wrote: > On 2018-06-28 10:17 AM, Chris Murphy wrote: > >> 2. The new data goes in a single chunk; even if the user does a manual >> balance (resync) their data isn't replicated. They must know to do a >> -dconvert balance to replicate the new data. Again this is a net worse >> behavior than mdadm out of the box, putting user data at risk. > > I'm not sure this is the case. Even though writes failed to the > disconnected device, btrfs seemed to keep on going as though it *were*. Yeah in your case the failure happens during normal operation and in that case there's no degraded state on Btrfs. So it keeps writing to raid1 chunk on the working drive, with writes on the failed devices going nowhere (with lots of write errors). When you stop using the volume, fix the problem with the missing drive, then remount the volume, it really should still use only the new copy on the never missing drive, even though it won't necessarily notice the file is missing on the formerly missing drive. You have to balance manually to fix it. > When the array was re-mounted with both devices, (never mounted as > degraded), and scrub was run, scrub took a *long* time fixing errors, at > a whopping 3MB/s, and reported having fixed millions of them. That's slow but it's expected to fix a lot of problems. Even in a very short amount of time there are thousands of missing data and metadata extents that need to be replicated. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
> Acceptable, but not really apply to software based RAID1. > Which completely disregards the minor detail that all the software Raid's I know of can handle exactly this kind of situation without loosing or corrupting a single byte of data, (Errors on the remaining hard drive notwithstanding.) Exactly what methods they employ to do so I'm not an expert at,, but it *does* work, contrary to your repeated assertions otherwise. In any case, thank you the for the patch you wrote. I will, however, propose a different solution. Given the reliance of BTRFS on csum, and the lack of any resynchronization, (no matter how the drives got out of sync, doesn't matter.). I think NoDataCow should just be ignored in the case of RAID, just like the data blocks would get copied if there was a snapshot. In the current implementation of RAID on btrfs, RAID and nodatacow are effectively mutually exclusive. Consider the kinds of use cases nodatacow is usually recommended for, VM images and databases. Even though those files should have their own mechanisms for dealing with incomplete writes, and data verification, BTRFS RAID creates a unique situation where parts of the file can be inconsistent, with different data being read depending on which device is doing the reading. Regardless of which method, short term and long term, developers choose to address this, this next part I have stress I consider very important. The status page really needs to be updated to reflect this gotchya. It *will* bite people in ways they do not expect, and disastrously. <> signature.asc Description: OpenPGP digital signature
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 06/28/2018 04:17 PM, Chris Murphy wrote: > Btrfs does two, maybe three, bad things: > 1. No automatic resync. This is a net worse behavior than mdadm and > lvm, putting data at risk. > 2. The new data goes in a single chunk; even if the user does a manual > balance (resync) their data isn't replicated. They must know to do a > -dconvert balance to replicate the new data. Again this is a net worse > behavior than mdadm out of the box, putting user data at risk. > 3. Apparently if nodatacow, given a file with two copies of different > transid, Btrfs won't always pick the higher transid copy? If true > that's terrible, and again not at all what mdadm/lvm are doing. All these could be avoided simply not allowing a multidevice filesystem to mount without ensuring that all the devices have the same generation. In the past I proposed a mount.btrfs helper; I am still thinking that it would be the right place to a) put all the check before mounting the filesystem b) print the correct information in order to help the user on what he has to do to solve the issues Regarding your point 3), it must be point out that in case of NOCOW files, even having the same transid it is not enough. It still be possible that a copy is update before a power failure preventing the super-block update. I think that the only way to prevent it to happens is: 1) using a data journal (which means that each data is copied two times) OR 2) using a cow filesystem (with cow enabled of course !) I think that this is a good example of why a HW Raid controller battery backed could be better than a SW raid. Of course the likelihood of a lot of problems could be reduced using a power supply. BR -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
28.06.2018 12:15, Qu Wenruo пишет: > > > On 2018年06月28日 16:16, Andrei Borzenkov wrote: >> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo wrote: >>> >>> >>> On 2018年06月28日 11:14, r...@georgianit.com wrote: On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: > > Please get yourself clear of what other raid1 is doing. A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically, >>> >>> WOW, that's black magic, at least for RAID1. >>> The whole RAID1 has no idea of which copy is correct unlike btrfs who >>> has datasum. >>> >>> Don't bother other things, just tell me how to determine which one is >>> correct? >>> >> >> When one drive fails, it is recorded in meta-data on remaining drives; >> probably configuration generation number is increased. Next time drive >> with older generation is not incorporated. Hardware controllers also >> keep this information in NVRAM and so do not even depend on scanning >> of other disks. > > Yep, the only possible way to determine such case is from external info. > > For device generation, it's possible to enhance btrfs, but at least we > could start from detect and refuse to RW mount to avoid possible further > corruption. > But anyway, if one really cares about such case, hardware RAID > controller seems to be the only solution as other software may have the > same problem. > > And the hardware solution looks pretty interesting, is the write to > NVRAM 100% atomic? Even at power loss? > >> >>> The only possibility is that, the misbehaved device missed several super >>> block update so we have a chance to detect it's out-of-date. >>> But that's not always working. >>> >> >> Why it should not work as long as any write to array is suspended >> until superblock on remaining devices is updated? > > What happens if there is no generation gap in device superblock? > Well, you use "generation" in strict btrfs sense, I use "generation" generically. That is exactly what btrfs apparently lacks currently - some monotonic counter that is used to record such event. > If one device got some of its (nodatacow) data written to disk, while > the other device doesn't get data written, and neither of them reached > super block update, there is no difference in device superblock, thus no > way to detect which is correct. > Again, the very fact that device failed should have triggered update of superblock to record this information which presumably should increase some counter. >> >>> If you're talking about missing generation check for btrfs, that's >>> valid, but it's far from a "major design flaw", as there are a lot of >>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected >>> (the brain-split case). >>> >> >> That's different. Yes, with software-based raid there is usually no >> way to detect outdated copy if no other copies are present. Having >> older valid data is still very different from corrupting newer data. > > While for VDI case (or any VM image file format other than raw), older > valid data normally means corruption. > Unless they have their own write-ahead log. >> Some file format may detect such problem by themselves if they have > internal checksum, but anyway, older data normally means corruption, > especially when partial new and partial old. > Yes, that's true. But there is really nothing that can be done here, even theoretically; it hardly a reason to not do what looks possible. > On the other hand, with data COW and csum, btrfs can ensure the whole > filesystem update is atomic (at least for single device). > So the title, especially the "major design flaw" can't be wrong any more. > >> others will automatically kick out the misbehaving drive. *none* of them will take back the the drive with old data and start commingling that data with good copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the most basic expectations of RAID. >>> >>> RAID1 can only tolerate 1 missing device, it has nothing to do with >>> error detection. >>> And it's impossible to detect such case without extra help. >>> >>> Your expectation is completely wrong. >>> >> >> Well ... somehow it is my experience as well ... :) > > Acceptable, but not really apply to software based RAID1. > > Thanks, > Qu > >> I'm not the one who has to clear his expectations here. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html >>> > signature.asc Description: OpenPGP digital signature
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-28 10:17 AM, Chris Murphy wrote: > 2. The new data goes in a single chunk; even if the user does a manual > balance (resync) their data isn't replicated. They must know to do a > -dconvert balance to replicate the new data. Again this is a net worse > behavior than mdadm out of the box, putting user data at risk. I'm not sure this is the case. Even though writes failed to the disconnected device, btrfs seemed to keep on going as though it *were*. When the array was re-mounted with both devices, (never mounted as degraded), and scrub was run, scrub took a *long* time fixing errors, at a whopping 3MB/s, and reported having fixed millions of them. <>
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
The problems are known with Btrfs raid1, but I think they bear repeating because they are really not OK. In the exact same described scenario: a simple clear cut drop off of a member device, which then later clearly reappears (no transient failure). Both mdadm and LVM based raid1 would have re-added the missing device and resynced it because internal bitmap is the default (on > 100G arrays for mdadm and always with lvm). Only the new data would be propagated to user space. Both mdadm and lvm have metadata to know which drive has stale data in this common scenario. Btrfs does two, maybe three, bad things: 1. No automatic resync. This is a net worse behavior than mdadm and lvm, putting data at risk. 2. The new data goes in a single chunk; even if the user does a manual balance (resync) their data isn't replicated. They must know to do a -dconvert balance to replicate the new data. Again this is a net worse behavior than mdadm out of the box, putting user data at risk. 3. Apparently if nodatacow, given a file with two copies of different transid, Btrfs won't always pick the higher transid copy? If true that's terrible, and again not at all what mdadm/lvm are doing. Btrfs can do better because it has more information available to make unambiguous decisions about data. But it needs to always do at least as good a job as mdadm/lvm and as reported, that didn't happen. So some tested is needed in particular case #3 above with nodatacow. That's a huge bug, if it's true. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 06/28/2018 09:42 AM, Remi Gauvin wrote: There seems to be a major design flaw with BTRFS that needs to be better documented, to avoid massive data loss. Tested with Raid 1 on Ubuntu Kernel 4.15 The use case being tested was a Virtualbox VDI file created with NODATACOW attribute, (as is often suggested, due to the painful performance penalty of COW on these files.) However, if a device is temporarily dropped (this in case, tested by disconnecting drives.) and re-connects automatically next boot, BTRFS does not in any way synchronize the VDI file, or have any means to know that one of copy is out of date and bad. The result of trying to use said VDI file is interestingly insane. Scrub did not do anything to rectify the situation. Please use Balance to rectify as its RAID1. Because when one of the device was missing we wrote Single Chunks. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-28 07:46, Qu Wenruo wrote: On 2018年06月28日 19:12, Austin S. Hemmelgarn wrote: On 2018-06-28 05:15, Qu Wenruo wrote: On 2018年06月28日 16:16, Andrei Borzenkov wrote: On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo wrote: On 2018年06月28日 11:14, r...@georgianit.com wrote: On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: Please get yourself clear of what other raid1 is doing. A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically, WOW, that's black magic, at least for RAID1. The whole RAID1 has no idea of which copy is correct unlike btrfs who has datasum. Don't bother other things, just tell me how to determine which one is correct? When one drive fails, it is recorded in meta-data on remaining drives; probably configuration generation number is increased. Next time drive with older generation is not incorporated. Hardware controllers also keep this information in NVRAM and so do not even depend on scanning of other disks. Yep, the only possible way to determine such case is from external info. For device generation, it's possible to enhance btrfs, but at least we could start from detect and refuse to RW mount to avoid possible further corruption. But anyway, if one really cares about such case, hardware RAID controller seems to be the only solution as other software may have the same problem. LVM doesn't. It detects that one of the devices was gone for some period of time and marks the volume as degraded (and _might_, depending on how you have things configured, automatically re-sync). Not sure about MD, but I am willing to bet it properly detects this type of situation too. And the hardware solution looks pretty interesting, is the write to NVRAM 100% atomic? Even at power loss? On a proper RAID controller, it's battery backed, and that battery backing provides enough power to also make sure that the state change is properly recorded in the event of power loss. Well, that explains a lot of thing. So hardware RAID controller can be considered having a special battery (always atomic) journal device. If we can't provide UPS for the whole system, a battery powered journal device indeed makes sense. The only possibility is that, the misbehaved device missed several super block update so we have a chance to detect it's out-of-date. But that's not always working. Why it should not work as long as any write to array is suspended until superblock on remaining devices is updated? What happens if there is no generation gap in device superblock? If one device got some of its (nodatacow) data written to disk, while the other device doesn't get data written, and neither of them reached super block update, there is no difference in device superblock, thus no way to detect which is correct. Yes, but that should be a very small window (at least, once we finally quit serializing writes across devices), and it's a problem on existing RAID1 implementations too (and therefore isn't something we should be using as an excuse for not doing this). If you're talking about missing generation check for btrfs, that's valid, but it's far from a "major design flaw", as there are a lot of cases where other RAID1 (mdraid or LVM mirrored) can also be affected (the brain-split case). That's different. Yes, with software-based raid there is usually no way to detect outdated copy if no other copies are present. Having older valid data is still very different from corrupting newer data. While for VDI case (or any VM image file format other than raw), older valid data normally means corruption. Unless they have their own write-ahead log. Some file format may detect such problem by themselves if they have internal checksum, but anyway, older data normally means corruption, especially when partial new and partial old. On the other hand, with data COW and csum, btrfs can ensure the whole filesystem update is atomic (at least for single device). So the title, especially the "major design flaw" can't be wrong any more. The title is excessive, but I'd agree it's a design flaw that BTRFS doesn't at least notice that the generation ID's are different and preferentially trust the device with the newer generation ID. Well, a design flaw should be something that can't be easily fixed without *huge* on-disk format or behavior change. Flaw in btrfs' one-subvolume-per-tree metadata design or current extent booking behavior could be called design flaw. That would be a structural design flaw. it's a result of how the software is structured. There are other types of design flaws though. While for things like this, just as the submitted RFC patch, less than 100 lines could change the behavior. I would still consider this case a design flaw (a purely behavioral one not tied to how
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018年06月28日 19:12, Austin S. Hemmelgarn wrote: > On 2018-06-28 05:15, Qu Wenruo wrote: >> >> >> On 2018年06月28日 16:16, Andrei Borzenkov wrote: >>> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo >>> wrote: On 2018年06月28日 11:14, r...@georgianit.com wrote: > > > On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: > >> >> Please get yourself clear of what other raid1 is doing. > > A drive failure, where the drive is still there when the computer > reboots, is a situation that *any* raid 1, (or for that matter, > raid 5, raid 6, anything but raid 0) will recover from perfectly > without raising a sweat. Some will rebuild the array automatically, WOW, that's black magic, at least for RAID1. The whole RAID1 has no idea of which copy is correct unlike btrfs who has datasum. Don't bother other things, just tell me how to determine which one is correct? >>> >>> When one drive fails, it is recorded in meta-data on remaining drives; >>> probably configuration generation number is increased. Next time drive >>> with older generation is not incorporated. Hardware controllers also >>> keep this information in NVRAM and so do not even depend on scanning >>> of other disks. >> >> Yep, the only possible way to determine such case is from external info. >> >> For device generation, it's possible to enhance btrfs, but at least we >> could start from detect and refuse to RW mount to avoid possible further >> corruption. >> But anyway, if one really cares about such case, hardware RAID >> controller seems to be the only solution as other software may have the >> same problem. > LVM doesn't. It detects that one of the devices was gone for some > period of time and marks the volume as degraded (and _might_, depending > on how you have things configured, automatically re-sync). Not sure > about MD, but I am willing to bet it properly detects this type of > situation too. >> >> And the hardware solution looks pretty interesting, is the write to >> NVRAM 100% atomic? Even at power loss? > On a proper RAID controller, it's battery backed, and that battery > backing provides enough power to also make sure that the state change is > properly recorded in the event of power loss. Well, that explains a lot of thing. So hardware RAID controller can be considered having a special battery (always atomic) journal device. If we can't provide UPS for the whole system, a battery powered journal device indeed makes sense. >> >>> The only possibility is that, the misbehaved device missed several super block update so we have a chance to detect it's out-of-date. But that's not always working. >>> >>> Why it should not work as long as any write to array is suspended >>> until superblock on remaining devices is updated? >> >> What happens if there is no generation gap in device superblock? >> >> If one device got some of its (nodatacow) data written to disk, while >> the other device doesn't get data written, and neither of them reached >> super block update, there is no difference in device superblock, thus no >> way to detect which is correct. > Yes, but that should be a very small window (at least, once we finally > quit serializing writes across devices), and it's a problem on existing > RAID1 implementations too (and therefore isn't something we should be > using as an excuse for not doing this). >> >>> If you're talking about missing generation check for btrfs, that's valid, but it's far from a "major design flaw", as there are a lot of cases where other RAID1 (mdraid or LVM mirrored) can also be affected (the brain-split case). >>> >>> That's different. Yes, with software-based raid there is usually no >>> way to detect outdated copy if no other copies are present. Having >>> older valid data is still very different from corrupting newer data. >> >> While for VDI case (or any VM image file format other than raw), older >> valid data normally means corruption. >> Unless they have their own write-ahead log. >> >> Some file format may detect such problem by themselves if they have >> internal checksum, but anyway, older data normally means corruption, >> especially when partial new and partial old. >> >> On the other hand, with data COW and csum, btrfs can ensure the whole >> filesystem update is atomic (at least for single device). >> So the title, especially the "major design flaw" can't be wrong any more. > The title is excessive, but I'd agree it's a design flaw that BTRFS > doesn't at least notice that the generation ID's are different and > preferentially trust the device with the newer generation ID. Well, a design flaw should be something that can't be easily fixed without *huge* on-disk format or behavior change. Flaw in btrfs' one-subvolume-per-tree metadata design or current extent booking behavior could be called design flaw. While for things like this, just as the submitted RFC
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-28 05:15, Qu Wenruo wrote: On 2018年06月28日 16:16, Andrei Borzenkov wrote: On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo wrote: On 2018年06月28日 11:14, r...@georgianit.com wrote: On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: Please get yourself clear of what other raid1 is doing. A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically, WOW, that's black magic, at least for RAID1. The whole RAID1 has no idea of which copy is correct unlike btrfs who has datasum. Don't bother other things, just tell me how to determine which one is correct? When one drive fails, it is recorded in meta-data on remaining drives; probably configuration generation number is increased. Next time drive with older generation is not incorporated. Hardware controllers also keep this information in NVRAM and so do not even depend on scanning of other disks. Yep, the only possible way to determine such case is from external info. For device generation, it's possible to enhance btrfs, but at least we could start from detect and refuse to RW mount to avoid possible further corruption. But anyway, if one really cares about such case, hardware RAID controller seems to be the only solution as other software may have the same problem. LVM doesn't. It detects that one of the devices was gone for some period of time and marks the volume as degraded (and _might_, depending on how you have things configured, automatically re-sync). Not sure about MD, but I am willing to bet it properly detects this type of situation too. And the hardware solution looks pretty interesting, is the write to NVRAM 100% atomic? Even at power loss? On a proper RAID controller, it's battery backed, and that battery backing provides enough power to also make sure that the state change is properly recorded in the event of power loss. The only possibility is that, the misbehaved device missed several super block update so we have a chance to detect it's out-of-date. But that's not always working. Why it should not work as long as any write to array is suspended until superblock on remaining devices is updated? What happens if there is no generation gap in device superblock? If one device got some of its (nodatacow) data written to disk, while the other device doesn't get data written, and neither of them reached super block update, there is no difference in device superblock, thus no way to detect which is correct. Yes, but that should be a very small window (at least, once we finally quit serializing writes across devices), and it's a problem on existing RAID1 implementations too (and therefore isn't something we should be using as an excuse for not doing this). If you're talking about missing generation check for btrfs, that's valid, but it's far from a "major design flaw", as there are a lot of cases where other RAID1 (mdraid or LVM mirrored) can also be affected (the brain-split case). That's different. Yes, with software-based raid there is usually no way to detect outdated copy if no other copies are present. Having older valid data is still very different from corrupting newer data. While for VDI case (or any VM image file format other than raw), older valid data normally means corruption. Unless they have their own write-ahead log. Some file format may detect such problem by themselves if they have internal checksum, but anyway, older data normally means corruption, especially when partial new and partial old. On the other hand, with data COW and csum, btrfs can ensure the whole filesystem update is atomic (at least for single device). So the title, especially the "major design flaw" can't be wrong any more. The title is excessive, but I'd agree it's a design flaw that BTRFS doesn't at least notice that the generation ID's are different and preferentially trust the device with the newer generation ID. The only special handling I can see that would be needed is around volumes mounted with the `nodatacow` option, which may not see generation changes for a very long time otherwise. others will automatically kick out the misbehaving drive. *none* of them will take back the the drive with old data and start commingling that data with good copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the most basic expectations of RAID. RAID1 can only tolerate 1 missing device, it has nothing to do with error detection. And it's impossible to detect such case without extra help. Your expectation is completely wrong. Well ... somehow it is my experience as well ... :) Acceptable, but not really apply to software based RAID1. Thanks, Qu I'm not the one who has to clear his expectations here. -- To unsubscribe from this list: send the line "unsubscribe
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018年06月28日 16:16, Andrei Borzenkov wrote: > On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo wrote: >> >> >> On 2018年06月28日 11:14, r...@georgianit.com wrote: >>> >>> >>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: >>> Please get yourself clear of what other raid1 is doing. >>> >>> A drive failure, where the drive is still there when the computer reboots, >>> is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, >>> anything but raid 0) will recover from perfectly without raising a sweat. >>> Some will rebuild the array automatically, >> >> WOW, that's black magic, at least for RAID1. >> The whole RAID1 has no idea of which copy is correct unlike btrfs who >> has datasum. >> >> Don't bother other things, just tell me how to determine which one is >> correct? >> > > When one drive fails, it is recorded in meta-data on remaining drives; > probably configuration generation number is increased. Next time drive > with older generation is not incorporated. Hardware controllers also > keep this information in NVRAM and so do not even depend on scanning > of other disks. Yep, the only possible way to determine such case is from external info. For device generation, it's possible to enhance btrfs, but at least we could start from detect and refuse to RW mount to avoid possible further corruption. But anyway, if one really cares about such case, hardware RAID controller seems to be the only solution as other software may have the same problem. And the hardware solution looks pretty interesting, is the write to NVRAM 100% atomic? Even at power loss? > >> The only possibility is that, the misbehaved device missed several super >> block update so we have a chance to detect it's out-of-date. >> But that's not always working. >> > > Why it should not work as long as any write to array is suspended > until superblock on remaining devices is updated? What happens if there is no generation gap in device superblock? If one device got some of its (nodatacow) data written to disk, while the other device doesn't get data written, and neither of them reached super block update, there is no difference in device superblock, thus no way to detect which is correct. > >> If you're talking about missing generation check for btrfs, that's >> valid, but it's far from a "major design flaw", as there are a lot of >> cases where other RAID1 (mdraid or LVM mirrored) can also be affected >> (the brain-split case). >> > > That's different. Yes, with software-based raid there is usually no > way to detect outdated copy if no other copies are present. Having > older valid data is still very different from corrupting newer data. While for VDI case (or any VM image file format other than raw), older valid data normally means corruption. Unless they have their own write-ahead log. Some file format may detect such problem by themselves if they have internal checksum, but anyway, older data normally means corruption, especially when partial new and partial old. On the other hand, with data COW and csum, btrfs can ensure the whole filesystem update is atomic (at least for single device). So the title, especially the "major design flaw" can't be wrong any more. > >>> others will automatically kick out the misbehaving drive. *none* of them >>> will take back the the drive with old data and start commingling that data >>> with good copy.)\ This behaviour from BTRFS is completely abnormal.. and >>> defeats even the most basic expectations of RAID. >> >> RAID1 can only tolerate 1 missing device, it has nothing to do with >> error detection. >> And it's impossible to detect such case without extra help. >> >> Your expectation is completely wrong. >> > > Well ... somehow it is my experience as well ... :) Acceptable, but not really apply to software based RAID1. Thanks, Qu > >>> >>> I'm not the one who has to clear his expectations here. >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> signature.asc Description: OpenPGP digital signature
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On Thu, Jun 28, 2018 at 11:16 AM, Andrei Borzenkov wrote: > On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo wrote: >> >> >> On 2018年06月28日 11:14, r...@georgianit.com wrote: >>> >>> >>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: >>> Please get yourself clear of what other raid1 is doing. >>> >>> A drive failure, where the drive is still there when the computer reboots, >>> is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, >>> anything but raid 0) will recover from perfectly without raising a sweat. >>> Some will rebuild the array automatically, >> >> WOW, that's black magic, at least for RAID1. >> The whole RAID1 has no idea of which copy is correct unlike btrfs who >> has datasum. >> >> Don't bother other things, just tell me how to determine which one is >> correct? >> > > When one drive fails, it is recorded in meta-data on remaining drives; > probably configuration generation number is increased. Next time drive > with older generation is not incorporated. Hardware controllers also > keep this information in NVRAM and so do not even depend on scanning > of other disks. > >> The only possibility is that, the misbehaved device missed several super >> block update so we have a chance to detect it's out-of-date. >> But that's not always working. >> > > Why it should not work as long as any write to array is suspended > until superblock on remaining devices is updated? > >> If you're talking about missing generation check for btrfs, that's >> valid, but it's far from a "major design flaw", as there are a lot of >> cases where other RAID1 (mdraid or LVM mirrored) can also be affected >> (the brain-split case). >> > > That's different. Yes, with software-based raid there is usually no > way to detect outdated copy if no other copies are present. Having > older valid data is still very different from corrupting newer data. > >>> others will automatically kick out the misbehaving drive. *none* of them >>> will take back the the drive with old data and start commingling that data >>> with good copy.)\ This behaviour from BTRFS is completely abnormal.. and >>> defeats even the most basic expectations of RAID. >> >> RAID1 can only tolerate 1 missing device, it has nothing to do with >> error detection. >> And it's impossible to detect such case without extra help. >> >> Your expectation is completely wrong. >> > > Well ... somehow it is my experience as well ... :) s/experience/expectation/ sorry. > >>> >>> I'm not the one who has to clear his expectations here. >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo wrote: > > > On 2018年06月28日 11:14, r...@georgianit.com wrote: >> >> >> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: >> >>> >>> Please get yourself clear of what other raid1 is doing. >> >> A drive failure, where the drive is still there when the computer reboots, >> is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, >> anything but raid 0) will recover from perfectly without raising a sweat. >> Some will rebuild the array automatically, > > WOW, that's black magic, at least for RAID1. > The whole RAID1 has no idea of which copy is correct unlike btrfs who > has datasum. > > Don't bother other things, just tell me how to determine which one is > correct? > When one drive fails, it is recorded in meta-data on remaining drives; probably configuration generation number is increased. Next time drive with older generation is not incorporated. Hardware controllers also keep this information in NVRAM and so do not even depend on scanning of other disks. > The only possibility is that, the misbehaved device missed several super > block update so we have a chance to detect it's out-of-date. > But that's not always working. > Why it should not work as long as any write to array is suspended until superblock on remaining devices is updated? > If you're talking about missing generation check for btrfs, that's > valid, but it's far from a "major design flaw", as there are a lot of > cases where other RAID1 (mdraid or LVM mirrored) can also be affected > (the brain-split case). > That's different. Yes, with software-based raid there is usually no way to detect outdated copy if no other copies are present. Having older valid data is still very different from corrupting newer data. >> others will automatically kick out the misbehaving drive. *none* of them >> will take back the the drive with old data and start commingling that data >> with good copy.)\ This behaviour from BTRFS is completely abnormal.. and >> defeats even the most basic expectations of RAID. > > RAID1 can only tolerate 1 missing device, it has nothing to do with > error detection. > And it's impossible to detect such case without extra help. > > Your expectation is completely wrong. > Well ... somehow it is my experience as well ... :) >> >> I'm not the one who has to clear his expectations here. >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018年06月28日 11:14, r...@georgianit.com wrote: > > > On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: > >> >> Please get yourself clear of what other raid1 is doing. > > A drive failure, where the drive is still there when the computer reboots, is > a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything > but raid 0) will recover from perfectly without raising a sweat. Some will > rebuild the array automatically, WOW, that's black magic, at least for RAID1. The whole RAID1 has no idea of which copy is correct unlike btrfs who has datasum. Don't bother other things, just tell me how to determine which one is correct? The only possibility is that, the misbehaved device missed several super block update so we have a chance to detect it's out-of-date. But that's not always working. If you're talking about missing generation check for btrfs, that's valid, but it's far from a "major design flaw", as there are a lot of cases where other RAID1 (mdraid or LVM mirrored) can also be affected (the brain-split case). > others will automatically kick out the misbehaving drive. *none* of them > will take back the the drive with old data and start commingling that data > with good copy.)\ This behaviour from BTRFS is completely abnormal.. and > defeats even the most basic expectations of RAID. RAID1 can only tolerate 1 missing device, it has nothing to do with error detection. And it's impossible to detect such case without extra help. Your expectation is completely wrong. > > I'm not the one who has to clear his expectations here. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > signature.asc Description: OpenPGP digital signature
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote: > > Please get yourself clear of what other raid1 is doing. A drive failure, where the drive is still there when the computer reboots, is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but raid 0) will recover from perfectly without raising a sweat. Some will rebuild the array automatically, others will automatically kick out the misbehaving drive. *none* of them will take back the the drive with old data and start commingling that data with good copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the most basic expectations of RAID. I'm not the one who has to clear his expectations here. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018年06月28日 10:10, Remi Gauvin wrote: > On 2018-06-27 09:58 PM, Qu Wenruo wrote: >> >> >> On 2018年06月28日 09:42, Remi Gauvin wrote: >>> There seems to be a major design flaw with BTRFS that needs to be better >>> documented, to avoid massive data loss. >>> >>> Tested with Raid 1 on Ubuntu Kernel 4.15 >>> >>> The use case being tested was a Virtualbox VDI file created with >>> NODATACOW attribute, (as is often suggested, due to the painful >>> performance penalty of COW on these files.) >> >> NODATACOW implies NODATASUM. >> > > yes yes,, none of which changes the simple fact that if you use this > option, which is often touted as outright necessary for some types of > files, BTRFS raid is worse than useless,, not only will it not protect > your data at all from bitrot, (as expected), it will actively go out of > it's way to corrupt it! > > This is not expected behaviour from 'Raid', and I despair that seems to > be something that I have to explain! Nope, all normal raid1 is the same, if you corrupt one copy, you won't know which one is correct. Btrfs csum is already doing much better job than plain raid1. Please get yourself clear of what other raid1 is doing. Thanks, Qu signature.asc Description: OpenPGP digital signature
Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018-06-27 09:58 PM, Qu Wenruo wrote: > > > On 2018年06月28日 09:42, Remi Gauvin wrote: >> There seems to be a major design flaw with BTRFS that needs to be better >> documented, to avoid massive data loss. >> >> Tested with Raid 1 on Ubuntu Kernel 4.15 >> >> The use case being tested was a Virtualbox VDI file created with >> NODATACOW attribute, (as is often suggested, due to the painful >> performance penalty of COW on these files.) > > NODATACOW implies NODATASUM. > yes yes,, none of which changes the simple fact that if you use this option, which is often touted as outright necessary for some types of files, BTRFS raid is worse than useless,, not only will it not protect your data at all from bitrot, (as expected), it will actively go out of it's way to corrupt it! This is not expected behaviour from 'Raid', and I despair that seems to be something that I have to explain! signature.asc Description: OpenPGP digital signature
Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
On 2018年06月28日 09:42, Remi Gauvin wrote: > There seems to be a major design flaw with BTRFS that needs to be better > documented, to avoid massive data loss. > > Tested with Raid 1 on Ubuntu Kernel 4.15 > > The use case being tested was a Virtualbox VDI file created with > NODATACOW attribute, (as is often suggested, due to the painful > performance penalty of COW on these files.) NODATACOW implies NODATASUM. From btrfs(5): --- Enable data copy-on-write for newly created files. Nodatacow implies nodatasum, and disables compression. All files created under nodatacow are also set the NOCOW file attribute (see chattr(1)). --- Although it's talking about the mount option, it also applies to per-inode options. Thanks, Qu > > However, if a device is temporarily dropped (this in case, tested by > disconnecting drives.) and re-connects automatically next boot, BTRFS > does not in any way synchronize the VDI file, or have any means to know > that one of copy is out of date and bad. > > The result of trying to use said VDI file is interestingly insane. > Scrub did not do anything to rectify the situation. > > signature.asc Description: OpenPGP digital signature
Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files
There seems to be a major design flaw with BTRFS that needs to be better documented, to avoid massive data loss. Tested with Raid 1 on Ubuntu Kernel 4.15 The use case being tested was a Virtualbox VDI file created with NODATACOW attribute, (as is often suggested, due to the painful performance penalty of COW on these files.) However, if a device is temporarily dropped (this in case, tested by disconnecting drives.) and re-connects automatically next boot, BTRFS does not in any way synchronize the VDI file, or have any means to know that one of copy is out of date and bad. The result of trying to use said VDI file is interestingly insane. Scrub did not do anything to rectify the situation. <>