Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-07-02 Thread Austin S. Hemmelgarn

On 2018-06-30 02:33, Duncan wrote:

Austin S. Hemmelgarn posted on Fri, 29 Jun 2018 14:31:04 -0400 as
excerpted:


On 2018-06-29 13:58, james harvey wrote:

On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn
 wrote:

On 2018-06-29 11:15, james harvey wrote:


On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy

wrote:


And an open question I have about scrub is weather it only ever is
checking csums, meaning nodatacow files are never scrubbed, or if
the copies are at least compared to each other?



Scrub never looks at nodatacow files.  It does not compare the copies
to each other.

Qu submitted a patch to make check compare the copies:
https://patchwork.kernel.org/patch/10434509/

This hasn't been added to btrfs-progs git yet.

IMO, I think the offline check should look at nodatacow copies like
this, but I still think this also needs to be added to scrub.  In the
patch thread, I discuss my reasons why.  In brief: online scanning;
this goes along with user's expectation of scrub ensuring mirrored
data integrity; and recommendations to setup scrub on periodic basis
to me means it's the place to put it.


That said, it can't sanely fix things if there is a mismatch. At
least,
not unless BTRFS gets proper generational tracking to handle
temporarily missing devices.  As of right now, sanely fixing things
requires significant manual intervention, as you have to bypass the
device read selection algorithm to be able to look at the state of the
individual copies so that you can pick one to use and forcibly rewrite
the whole file by hand.


Absolutely.  User would need to use manual intervention as you
describe, or restore the single file(s) from backup.  But, it's a good
opportunity to tell the user they had partial data corruption, even if
it can't be auto-fixed.  Otherwise they get intermittent data
corruption, depending on which copies are read.



The thing is though, as things stand right now, you need to manually
edit the data on-disk directly or restore the file from a backup to fix
the file.  While it's technically true that you can manually repair this
type of thing, both of the cases for doing it without those patches I
mentioned, it's functionally impossible for a regular user to do it
without potentially losing some data.


[Usual backups rant, user vs. admin variant, nowcow/tmpfs edition.
Regulars can skip as the rest is already predicted from past posts, for
them. =;^]

"Regular user"?

"Regular users" don't need to bother with this level of detail.  They
simply get their "admin" to do it, even if that "admin" is their kid, or
the kid from next door that's good with computers, or the geek squad (aka
nsa-agent-squad) guy/gal, doing it... or telling them to install "a real
OS", meaning whatever MS/Apple/Google something that they know how to
deal with.

If the "user" is dealing with setting nocow, choosing btrfs in the first
place, etc, then they're _not_ a "regular user" by definition, they're
already an admin.I'd argue that that's not always true.  'Regular users' also bli9ndly 
follow advice they find online about how to make their system run 
better, and quite often don't keep backups.


And as any admin learns rather quickly, the value of data is defined by
the number of backups it's worth having of that data.

Which means it's not a problem.  Either the data had a backup and it's
(reasonably) trivial to restore the data from that backup, or the data
was defined by lack of having that backup as of only trivial value, so
low as to not be worth the time/trouble/resources necessary to make that
backup in the first place.

Which of course means what was defined as of most value, either the data
of there was a backup, or the time/trouble/resources that would have gone
into creating it if not, is *always* saved.

(And of course the same goes for "I had a backup, but it's old", except
in this case it's the value of the data delta between the backup and
current.  As soon as it's worth more than the time/trouble/hassle of
updating the backup, it will by definition be updated.  Not having a
newer backup available thus simply means the value of the data that
changed between the last backup and current was simply not enough to
justify updating the backup, and again, what was of most value is
*always* saved, either the data, or the time that would have otherwise
gone into making the newer backup.)

Because while a "regular user" may not know it because it's not his /job/
to know it, if there's anything an admin knows *well* it's that the
working copy of data **WILL** be damaged.  It's not a matter of if, but
of when, and of whether it'll be a fat-finger mistake, or a hardware or
software failure, or wetware (theft, ransomware, etc), or wetware (flood,
fire and the water that put it out damage, etc), tho none of that
actually matters after all, because in the end, the only thing that
matters was how the value of that data was defined by the number of
backups made of it, and how quickly and conveniently at 

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-30 Thread Duncan
Austin S. Hemmelgarn posted on Fri, 29 Jun 2018 14:31:04 -0400 as
excerpted:

> On 2018-06-29 13:58, james harvey wrote:
>> On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn
>>  wrote:
>>> On 2018-06-29 11:15, james harvey wrote:

 On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy
 
 wrote:
>
> And an open question I have about scrub is weather it only ever is
> checking csums, meaning nodatacow files are never scrubbed, or if
> the copies are at least compared to each other?


 Scrub never looks at nodatacow files.  It does not compare the copies
 to each other.

 Qu submitted a patch to make check compare the copies:
 https://patchwork.kernel.org/patch/10434509/

 This hasn't been added to btrfs-progs git yet.

 IMO, I think the offline check should look at nodatacow copies like
 this, but I still think this also needs to be added to scrub.  In the
 patch thread, I discuss my reasons why.  In brief: online scanning;
 this goes along with user's expectation of scrub ensuring mirrored
 data integrity; and recommendations to setup scrub on periodic basis
 to me means it's the place to put it.
>>>
>>> That said, it can't sanely fix things if there is a mismatch. At
>>> least,
>>> not unless BTRFS gets proper generational tracking to handle
>>> temporarily missing devices.  As of right now, sanely fixing things
>>> requires significant manual intervention, as you have to bypass the
>>> device read selection algorithm to be able to look at the state of the
>>> individual copies so that you can pick one to use and forcibly rewrite
>>> the whole file by hand.
>> 
>> Absolutely.  User would need to use manual intervention as you
>> describe, or restore the single file(s) from backup.  But, it's a good
>> opportunity to tell the user they had partial data corruption, even if
>> it can't be auto-fixed.  Otherwise they get intermittent data
>> corruption, depending on which copies are read.

> The thing is though, as things stand right now, you need to manually
> edit the data on-disk directly or restore the file from a backup to fix
> the file.  While it's technically true that you can manually repair this
> type of thing, both of the cases for doing it without those patches I
> mentioned, it's functionally impossible for a regular user to do it
> without potentially losing some data.

[Usual backups rant, user vs. admin variant, nowcow/tmpfs edition.  
Regulars can skip as the rest is already predicted from past posts, for 
them. =;^]

"Regular user"?  

"Regular users" don't need to bother with this level of detail.  They 
simply get their "admin" to do it, even if that "admin" is their kid, or 
the kid from next door that's good with computers, or the geek squad (aka 
nsa-agent-squad) guy/gal, doing it... or telling them to install "a real 
OS", meaning whatever MS/Apple/Google something that they know how to 
deal with.

If the "user" is dealing with setting nocow, choosing btrfs in the first 
place, etc, then they're _not_ a "regular user" by definition, they're 
already an admin.

And as any admin learns rather quickly, the value of data is defined by 
the number of backups it's worth having of that data.

Which means it's not a problem.  Either the data had a backup and it's 
(reasonably) trivial to restore the data from that backup, or the data 
was defined by lack of having that backup as of only trivial value, so 
low as to not be worth the time/trouble/resources necessary to make that 
backup in the first place.

Which of course means what was defined as of most value, either the data 
of there was a backup, or the time/trouble/resources that would have gone 
into creating it if not, is *always* saved.

(And of course the same goes for "I had a backup, but it's old", except 
in this case it's the value of the data delta between the backup and 
current.  As soon as it's worth more than the time/trouble/hassle of 
updating the backup, it will by definition be updated.  Not having a 
newer backup available thus simply means the value of the data that 
changed between the last backup and current was simply not enough to 
justify updating the backup, and again, what was of most value is 
*always* saved, either the data, or the time that would have otherwise 
gone into making the newer backup.)

Because while a "regular user" may not know it because it's not his /job/ 
to know it, if there's anything an admin knows *well* it's that the 
working copy of data **WILL** be damaged.  It's not a matter of if, but 
of when, and of whether it'll be a fat-finger mistake, or a hardware or 
software failure, or wetware (theft, ransomware, etc), or wetware (flood, 
fire and the water that put it out damage, etc), tho none of that 
actually matters after all, because in the end, the only thing that 
matters was how the value of that data was defined by the number of 
backups made of it, and how quickly and conveniently at least 

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-29 Thread Chris Murphy
On Fri, Jun 29, 2018 at 9:15 AM, james harvey  wrote:
> On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy  wrote:
>> And an open question I have about scrub is weather it only ever is
>> checking csums, meaning nodatacow files are never scrubbed, or if the
>> copies are at least compared to each other?
>
> Scrub never looks at nodatacow files.  It does not compare the copies
> to each other.
>
> Qu submitted a patch to make check compare the copies:
> https://patchwork.kernel.org/patch/10434509/

Yeah online scrub needs to report any mismatches, even if it can't do
anything about it because it's ambiguous which copy is wrong.


> IMO, I think the offline check should look at nodatacow copies like
> this, but I still think this also needs to be added to scrub.  In the
> patch thread, I discuss my reasons why.  In brief: online scanning;
> this goes along with user's expectation of scrub ensuring mirrored
> data integrity; and recommendations to setup scrub on periodic basis
> to me means it's the place to put it.

I don't mind this being implemented in offline scrub first for testing
purposes. But the online scrub certainly should have this ability
eventually.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-29 Thread Austin S. Hemmelgarn

On 2018-06-29 13:58, james harvey wrote:

On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn
 wrote:

On 2018-06-29 11:15, james harvey wrote:


On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy 
wrote:


And an open question I have about scrub is weather it only ever is
checking csums, meaning nodatacow files are never scrubbed, or if the
copies are at least compared to each other?



Scrub never looks at nodatacow files.  It does not compare the copies
to each other.

Qu submitted a patch to make check compare the copies:
https://patchwork.kernel.org/patch/10434509/

This hasn't been added to btrfs-progs git yet.

IMO, I think the offline check should look at nodatacow copies like
this, but I still think this also needs to be added to scrub.  In the
patch thread, I discuss my reasons why.  In brief: online scanning;
this goes along with user's expectation of scrub ensuring mirrored
data integrity; and recommendations to setup scrub on periodic basis
to me means it's the place to put it.


That said, it can't sanely fix things if there is a mismatch. At least, not
unless BTRFS gets proper generational tracking to handle temporarily missing
devices.  As of right now, sanely fixing things requires significant manual
intervention, as you have to bypass the device read selection algorithm to
be able to look at the state of the individual copies so that you can pick
one to use and forcibly rewrite the whole file by hand.


Absolutely.  User would need to use manual intervention as you
describe, or restore the single file(s) from backup.  But, it's a good
opportunity to tell the user they had partial data corruption, even if
it can't be auto-fixed.  Otherwise they get intermittent data
corruption, depending on which copies are read.
The thing is though, as things stand right now, you need to manually 
edit the data on-disk directly or restore the file from a backup to fix 
the file.  While it's technically true that you can manually repair this 
type of thing, both of the cases for doing it without those patches I 
mentioned, it's functionally impossible for a regular user to do it 
without potentially losing some data.


Unless that changes, scrub telling you it's corrupt is not going to help 
much aside from making sure you don't make things worse by trying to use 
it.  Given this, it would make sense to have a (disabled by default) 
option to have scrub repair it by just using the newer or older copy of 
the data.  That would require classic RAID generational tracking though, 
which BTRFS doesn't have yet.



A while back, Anand Jain posted some patches that would let you select a
particular device to direct all reads to via a mount option, but I don't
think they ever got merged.  That would have made manual recovery in cases
like this exponentially easier (mount read-only with one device selected,
copy the file out somewhere, remount read-only with the other device, drop
caches, copy the file out again, compare and reconcile the two copies, then
remount the volume writable and write out the repaired file).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-29 Thread james harvey
On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn
 wrote:
> On 2018-06-29 11:15, james harvey wrote:
>>
>> On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy 
>> wrote:
>>>
>>> And an open question I have about scrub is weather it only ever is
>>> checking csums, meaning nodatacow files are never scrubbed, or if the
>>> copies are at least compared to each other?
>>
>>
>> Scrub never looks at nodatacow files.  It does not compare the copies
>> to each other.
>>
>> Qu submitted a patch to make check compare the copies:
>> https://patchwork.kernel.org/patch/10434509/
>>
>> This hasn't been added to btrfs-progs git yet.
>>
>> IMO, I think the offline check should look at nodatacow copies like
>> this, but I still think this also needs to be added to scrub.  In the
>> patch thread, I discuss my reasons why.  In brief: online scanning;
>> this goes along with user's expectation of scrub ensuring mirrored
>> data integrity; and recommendations to setup scrub on periodic basis
>> to me means it's the place to put it.
>
> That said, it can't sanely fix things if there is a mismatch. At least, not
> unless BTRFS gets proper generational tracking to handle temporarily missing
> devices.  As of right now, sanely fixing things requires significant manual
> intervention, as you have to bypass the device read selection algorithm to
> be able to look at the state of the individual copies so that you can pick
> one to use and forcibly rewrite the whole file by hand.

Absolutely.  User would need to use manual intervention as you
describe, or restore the single file(s) from backup.  But, it's a good
opportunity to tell the user they had partial data corruption, even if
it can't be auto-fixed.  Otherwise they get intermittent data
corruption, depending on which copies are read.

> A while back, Anand Jain posted some patches that would let you select a
> particular device to direct all reads to via a mount option, but I don't
> think they ever got merged.  That would have made manual recovery in cases
> like this exponentially easier (mount read-only with one device selected,
> copy the file out somewhere, remount read-only with the other device, drop
> caches, copy the file out again, compare and reconcile the two copies, then
> remount the volume writable and write out the repaired file).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-29 Thread Austin S. Hemmelgarn

On 2018-06-29 11:15, james harvey wrote:

On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy  wrote:

And an open question I have about scrub is weather it only ever is
checking csums, meaning nodatacow files are never scrubbed, or if the
copies are at least compared to each other?


Scrub never looks at nodatacow files.  It does not compare the copies
to each other.

Qu submitted a patch to make check compare the copies:
https://patchwork.kernel.org/patch/10434509/

This hasn't been added to btrfs-progs git yet.

IMO, I think the offline check should look at nodatacow copies like
this, but I still think this also needs to be added to scrub.  In the
patch thread, I discuss my reasons why.  In brief: online scanning;
this goes along with user's expectation of scrub ensuring mirrored
data integrity; and recommendations to setup scrub on periodic basis
to me means it's the place to put it.
That said, it can't sanely fix things if there is a mismatch.  At least, 
not unless BTRFS gets proper generational tracking to handle temporarily 
missing devices.  As of right now, sanely fixing things requires 
significant manual intervention, as you have to bypass the device read 
selection algorithm to be able to look at the state of the individual 
copies so that you can pick one to use and forcibly rewrite the whole 
file by hand.


A while back, Anand Jain posted some patches that would let you select a 
particular device to direct all reads to via a mount option, but I don't 
think they ever got merged.  That would have made manual recovery in 
cases like this exponentially easier (mount read-only with one device 
selected, copy the file out somewhere, remount read-only with the other 
device, drop caches, copy the file out again, compare and reconcile the 
two copies, then remount the volume writable and write out the repaired 
file).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-29 Thread james harvey
On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy  wrote:
> And an open question I have about scrub is weather it only ever is
> checking csums, meaning nodatacow files are never scrubbed, or if the
> copies are at least compared to each other?

Scrub never looks at nodatacow files.  It does not compare the copies
to each other.

Qu submitted a patch to make check compare the copies:
https://patchwork.kernel.org/patch/10434509/

This hasn't been added to btrfs-progs git yet.

IMO, I think the offline check should look at nodatacow copies like
this, but I still think this also needs to be added to scrub.  In the
patch thread, I discuss my reasons why.  In brief: online scanning;
this goes along with user's expectation of scrub ensuring mirrored
data integrity; and recommendations to setup scrub on periodic basis
to me means it's the place to put it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Qu Wenruo


On 2018年06月29日 01:10, Andrei Borzenkov wrote:
> 28.06.2018 12:15, Qu Wenruo пишет:
>>
>>
>> On 2018年06月28日 16:16, Andrei Borzenkov wrote:
>>> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo  wrote:


 On 2018年06月28日 11:14, r...@georgianit.com wrote:
>
>
> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>
>>
>> Please get yourself clear of what other raid1 is doing.
>
> A drive failure, where the drive is still there when the computer 
> reboots, is a situation that *any* raid 1, (or for that matter, raid 5, 
> raid 6, anything but raid 0) will recover from perfectly without raising 
> a sweat. Some will rebuild the array automatically,

 WOW, that's black magic, at least for RAID1.
 The whole RAID1 has no idea of which copy is correct unlike btrfs who
 has datasum.

 Don't bother other things, just tell me how to determine which one is
 correct?

>>>
>>> When one drive fails, it is recorded in meta-data on remaining drives;
>>> probably configuration generation number is increased. Next time drive
>>> with older generation is not incorporated. Hardware controllers also
>>> keep this information in NVRAM and so do not even depend on scanning
>>> of other disks.
>>
>> Yep, the only possible way to determine such case is from external info.
>>
>> For device generation, it's possible to enhance btrfs, but at least we
>> could start from detect and refuse to RW mount to avoid possible further
>> corruption.
>> But anyway, if one really cares about such case, hardware RAID
>> controller seems to be the only solution as other software may have the
>> same problem.
>>
>> And the hardware solution looks pretty interesting, is the write to
>> NVRAM 100% atomic? Even at power loss?
>>
>>>
 The only possibility is that, the misbehaved device missed several super
 block update so we have a chance to detect it's out-of-date.
 But that's not always working.

>>>
>>> Why it should not work as long as any write to array is suspended
>>> until superblock on remaining devices is updated?
>>
>> What happens if there is no generation gap in device superblock?
>>
> 
> Well, you use "generation" in strict btrfs sense, I use "generation"
> generically. That is exactly what btrfs apparently lacks currently -
> some monotonic counter that is used to record such event.

Indeed, btrfs doesn't have any way to record which device get degraded
at all.
The usage of btrfs device generation is already kind of workaround.

So to keep the same behavior of mdraid/lvm, each time btrfs detects a
device missing/fatal command (flush/fua) not executed correctly, btrfs
needs to record it, maybe into its device item, and commit it to disk.

In short, the btrfs csum makes us a little conceited about such device
missing case, normally csum will tell us which data is wrong so we could
avoid complex device status tracking.
But apparently, if nodatasum is involved, everything just goes out of
our expectation.

> 
>> If one device got some of its (nodatacow) data written to disk, while
>> the other device doesn't get data written, and neither of them reached
>> super block update, there is no difference in device superblock, thus no
>> way to detect which is correct.
>>
> 
> Again, the very fact that device failed should have triggered update of
> superblock to record this information which presumably should increase
> some counter.

Indeed.

> 
>>>
 If you're talking about missing generation check for btrfs, that's
 valid, but it's far from a "major design flaw", as there are a lot of
 cases where other RAID1 (mdraid or LVM mirrored) can also be affected
 (the brain-split case).

>>>
>>> That's different. Yes, with software-based raid there is usually no
>>> way to detect outdated copy if no other copies are present. Having
>>> older valid data is still very different from corrupting newer data.
>>
>> While for VDI case (or any VM image file format other than raw), older
>> valid data normally means corruption.
>> Unless they have their own write-ahead log.
>>> Some file format may detect such problem by themselves if they have
>> internal checksum, but anyway, older data normally means corruption,
>> especially when partial new and partial old.
>>
> 
> Yes, that's true. But there is really nothing that can be done here,
> even theoretically; it hardly a reason to not do what looks possible.

Well, theoretically, you can just use datasum and datacow :)

Thanks,
Qu

> 
>> On the other hand, with data COW and csum, btrfs can ensure the whole
>> filesystem update is atomic (at least for single device).
>> So the title, especially the "major design flaw" can't be wrong any more.
>>
>>>
> others will automatically kick out the misbehaving drive.  *none* of them 
> will take back the the drive with old data and start commingling that 
> data with good copy.)\ This behaviour from BTRFS is completely abnormal.. 
> and 

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Chris Murphy
On Thu, Jun 28, 2018 at 11:37 AM, Goffredo Baroncelli
 wrote:

> Regarding your point 3), it must be point out that in case of NOCOW files, 
> even having the same transid it is not enough. It still be possible that a 
> copy is update before a power failure preventing the super-block update.
> I think that the only way to prevent it to happens is:
>   1) using a data journal (which means that each data is copied two times)
> OR
>   2) using a cow filesystem (with cow enabled of course !)


There is no power failure in this example. So it's really off the
table considering whether Btrfs or mdadm/lvm raid do better in the
same situation with a nodatacow file.

I think here is the problem in the Btrfs nodatacow case. Btrfs doesn't
have a way of untrusting nodatacow files on a previously missing drive
that hasn't been balanced. There is no such thing as nometadatacow, so
no matter what it figures out there's a problem, and uses the good
copy of metadata, but it never "marks" the previously missing device
as suspicious. When it comes time to read a nodatacow file, Btrfs just
blindly reads off one of the drives, it has no mechanism for
questioning the formerly missing drive without csum.

That is actually a really weird and unique kind of write hole for
Btrfs raid1 when the data is nodatacow.

I have to agree with Remi. This is a flaw in the design or bad bug,
however you want to consider it. Because mdadm/lvm do not behave this
way in the exact same situation.

And an open question I have about scrub is weather it only ever is
checking csums, meaning nodatacow files are never scrubbed, or if the
copies are at least compared to each other?

As for fixes:

- During mount time, Btrfs sees from supers that there is a transid
mismatch, to not read nodatacow files from the lower transid device
until an auto balance has completed. Right now Btrfs doesn't have an
abbreviated balance that "replays" the events between two transids.
Basically it would work like send/receive but for balance to catch up
a previously missing device. Right now we have to do a full balance
which is a brutal penalty for a briefly missing drive. Again, mdadm
and lvm do better here by default.

- Fix the performance issues of COW with disk images. ZFS doesn't even
have a nodatacow option and they're running VM images on ZFS and it
doesn't sound like they're running into ridiculous performance
penalties that makes it impractical to use.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Chris Murphy
On Thu, Jun 28, 2018 at 9:37 AM, Remi Gauvin  wrote:
> On 2018-06-28 10:17 AM, Chris Murphy wrote:
>
>> 2. The new data goes in a single chunk; even if the user does a manual
>> balance (resync) their data isn't replicated. They must know to do a
>> -dconvert balance to replicate the new data. Again this is a net worse
>> behavior than mdadm out of the box, putting user data at risk.
>
> I'm not sure this is the case.  Even though writes failed to the
> disconnected device, btrfs seemed to keep on going as though it *were*.

Yeah in your case the failure happens during normal operation and in
that case there's no degraded state on Btrfs. So it keeps writing to
raid1 chunk on the working drive, with writes on the failed devices
going nowhere (with lots of write errors). When you stop using the
volume, fix the problem with the missing drive, then remount the
volume, it really should still use only the new copy on the never
missing drive, even though it won't necessarily notice the file is
missing on the formerly missing drive. You have to balance manually to
fix it.


> When the array was re-mounted with both devices, (never mounted as
> degraded), and scrub was run, scrub took a *long* time fixing errors, at
> a whopping 3MB/s, and reported having fixed millions of them.

That's slow but it's expected to fix a lot of problems. Even in a very
short amount of time there are thousands of missing data and metadata
extents that need to be replicated.




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Remi Gauvin

> Acceptable, but not really apply to software based RAID1.
> 

Which completely disregards the minor detail that all the software
Raid's I know of can handle exactly this kind of situation without
loosing or corrupting a single byte of data, (Errors on the remaining
hard drive notwithstanding.)

Exactly what methods they employ to do so I'm not an expert at,, but it
*does* work, contrary to your repeated assertions otherwise.

In any case, thank you the for the patch you wrote.  I will, however,
propose a different solution.

Given the reliance of BTRFS on csum, and the lack of any
resynchronization, (no matter how the drives got out of sync, doesn't
matter.).  I think NoDataCow should just be ignored in the case of RAID,
just like the data blocks would get copied if there was a snapshot.

In the current implementation of RAID on btrfs, RAID and nodatacow are
effectively mutually exclusive.  Consider the kinds of use cases
nodatacow is usually recommended for,  VM images and databases.   Even
though those files should have their own mechanisms for dealing with
incomplete writes, and data verification, BTRFS RAID creates a unique
situation where parts of the file can be inconsistent, with different
data being read depending on which device is doing the reading.

Regardless of which method, short term and long term, developers choose
to address this, this next part I have stress I consider very important.

The status page really needs to be updated to reflect this gotchya.  It
*will* bite people in ways they do not expect, and disastrously.

<>

signature.asc
Description: OpenPGP digital signature


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Goffredo Baroncelli
On 06/28/2018 04:17 PM, Chris Murphy wrote:
> Btrfs does two, maybe three, bad things:
> 1. No automatic resync. This is a net worse behavior than mdadm and
> lvm, putting data at risk.
> 2. The new data goes in a single chunk; even if the user does a manual
> balance (resync) their data isn't replicated. They must know to do a
> -dconvert balance to replicate the new data. Again this is a net worse
> behavior than mdadm out of the box, putting user data at risk.
> 3. Apparently if nodatacow, given a file with two copies of different
> transid, Btrfs won't always pick the higher transid copy? If true
> that's terrible, and again not at all what mdadm/lvm are doing.

All these could be avoided simply not allowing a multidevice filesystem to 
mount without ensuring that all the devices have the same generation.

In the past I proposed a mount.btrfs helper; I am still thinking that it would 
be the right place to
a) put all the check before mounting the filesystem
b) print the correct information in order to help the user on what he has to do 
to solve the issues

Regarding your point 3), it must be point out that in case of NOCOW files, even 
having the same transid it is not enough. It still be possible that a copy is 
update before a power failure preventing the super-block update.
I think that the only way to prevent it to happens is:
  1) using a data journal (which means that each data is copied two times)
OR
  2) using a cow filesystem (with cow enabled of course !)

I think that this is a good example of why a HW Raid controller battery backed 
could be better than a SW raid. Of course the likelihood of a lot of problems 
could be reduced using a power supply.


BR
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Andrei Borzenkov
28.06.2018 12:15, Qu Wenruo пишет:
> 
> 
> On 2018年06月28日 16:16, Andrei Borzenkov wrote:
>> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年06月28日 11:14, r...@georgianit.com wrote:


 On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:

>
> Please get yourself clear of what other raid1 is doing.

 A drive failure, where the drive is still there when the computer reboots, 
 is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, 
 anything but raid 0) will recover from perfectly without raising a sweat. 
 Some will rebuild the array automatically,
>>>
>>> WOW, that's black magic, at least for RAID1.
>>> The whole RAID1 has no idea of which copy is correct unlike btrfs who
>>> has datasum.
>>>
>>> Don't bother other things, just tell me how to determine which one is
>>> correct?
>>>
>>
>> When one drive fails, it is recorded in meta-data on remaining drives;
>> probably configuration generation number is increased. Next time drive
>> with older generation is not incorporated. Hardware controllers also
>> keep this information in NVRAM and so do not even depend on scanning
>> of other disks.
> 
> Yep, the only possible way to determine such case is from external info.
> 
> For device generation, it's possible to enhance btrfs, but at least we
> could start from detect and refuse to RW mount to avoid possible further
> corruption.
> But anyway, if one really cares about such case, hardware RAID
> controller seems to be the only solution as other software may have the
> same problem.
> 
> And the hardware solution looks pretty interesting, is the write to
> NVRAM 100% atomic? Even at power loss?
> 
>>
>>> The only possibility is that, the misbehaved device missed several super
>>> block update so we have a chance to detect it's out-of-date.
>>> But that's not always working.
>>>
>>
>> Why it should not work as long as any write to array is suspended
>> until superblock on remaining devices is updated?
> 
> What happens if there is no generation gap in device superblock?
> 

Well, you use "generation" in strict btrfs sense, I use "generation"
generically. That is exactly what btrfs apparently lacks currently -
some monotonic counter that is used to record such event.

> If one device got some of its (nodatacow) data written to disk, while
> the other device doesn't get data written, and neither of them reached
> super block update, there is no difference in device superblock, thus no
> way to detect which is correct.
> 

Again, the very fact that device failed should have triggered update of
superblock to record this information which presumably should increase
some counter.

>>
>>> If you're talking about missing generation check for btrfs, that's
>>> valid, but it's far from a "major design flaw", as there are a lot of
>>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
>>> (the brain-split case).
>>>
>>
>> That's different. Yes, with software-based raid there is usually no
>> way to detect outdated copy if no other copies are present. Having
>> older valid data is still very different from corrupting newer data.
> 
> While for VDI case (or any VM image file format other than raw), older
> valid data normally means corruption.
> Unless they have their own write-ahead log.
>> Some file format may detect such problem by themselves if they have
> internal checksum, but anyway, older data normally means corruption,
> especially when partial new and partial old.
>

Yes, that's true. But there is really nothing that can be done here,
even theoretically; it hardly a reason to not do what looks possible.

> On the other hand, with data COW and csum, btrfs can ensure the whole
> filesystem update is atomic (at least for single device).
> So the title, especially the "major design flaw" can't be wrong any more.
> 
>>
 others will automatically kick out the misbehaving drive.  *none* of them 
 will take back the the drive with old data and start commingling that data 
 with good copy.)\ This behaviour from BTRFS is completely abnormal.. and 
 defeats even the most basic expectations of RAID.
>>>
>>> RAID1 can only tolerate 1 missing device, it has nothing to do with
>>> error detection.
>>> And it's impossible to detect such case without extra help.
>>>
>>> Your expectation is completely wrong.
>>>
>>
>> Well ... somehow it is my experience as well ... :)
> 
> Acceptable, but not really apply to software based RAID1.
> 
> Thanks,
> Qu
> 
>>

 I'm not the one who has to clear his expectations here.

 --
 To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

>>>
> 




signature.asc
Description: OpenPGP digital signature


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Remi Gauvin
On 2018-06-28 10:17 AM, Chris Murphy wrote:

> 2. The new data goes in a single chunk; even if the user does a manual
> balance (resync) their data isn't replicated. They must know to do a
> -dconvert balance to replicate the new data. Again this is a net worse
> behavior than mdadm out of the box, putting user data at risk.

I'm not sure this is the case.  Even though writes failed to the
disconnected device, btrfs seemed to keep on going as though it *were*.

When the array was re-mounted with both devices, (never mounted as
degraded), and scrub was run, scrub took a *long* time fixing errors, at
a whopping 3MB/s, and reported having fixed millions of them.


<>

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Chris Murphy
The problems are known with Btrfs raid1, but I think they bear
repeating because they are really not OK.

In the exact same described scenario: a simple clear cut drop off of a
member device, which then later clearly reappears (no transient
failure).

Both mdadm and LVM based raid1 would have re-added the missing device
and resynced it because internal bitmap is the default (on > 100G
arrays for mdadm and always with lvm). Only the new data would be
propagated to user space. Both mdadm and lvm have metadata to know
which drive has stale data in this common scenario.

Btrfs does two, maybe three, bad things:
1. No automatic resync. This is a net worse behavior than mdadm and
lvm, putting data at risk.
2. The new data goes in a single chunk; even if the user does a manual
balance (resync) their data isn't replicated. They must know to do a
-dconvert balance to replicate the new data. Again this is a net worse
behavior than mdadm out of the box, putting user data at risk.
3. Apparently if nodatacow, given a file with two copies of different
transid, Btrfs won't always pick the higher transid copy? If true
that's terrible, and again not at all what mdadm/lvm are doing.


Btrfs can do better because it has more information available to make
unambiguous decisions about data. But it needs to always do at least
as good a job as mdadm/lvm and as reported, that didn't happen. So
some tested is needed in particular case #3 above with nodatacow.
That's a huge bug, if it's true.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Anand Jain




On 06/28/2018 09:42 AM, Remi Gauvin wrote:

There seems to be a major design flaw with BTRFS that needs to be better
documented, to avoid massive data loss.

Tested with Raid 1 on Ubuntu Kernel 4.15

The use case being tested was a Virtualbox VDI file created with
NODATACOW attribute, (as is often suggested, due to the painful
performance penalty of COW on these files.)

However, if a device is temporarily dropped (this in case, tested by
disconnecting drives.) and re-connects automatically next boot, BTRFS
does not in any way synchronize the VDI file, or have any means to know
that one of copy is out of date and bad.

The result of trying to use said VDI file is interestingly insane.




Scrub did not do anything to rectify the situation.


 Please use Balance to rectify as its RAID1. Because when one of the
 device was missing we wrote Single Chunks.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Austin S. Hemmelgarn

On 2018-06-28 07:46, Qu Wenruo wrote:



On 2018年06月28日 19:12, Austin S. Hemmelgarn wrote:

On 2018-06-28 05:15, Qu Wenruo wrote:



On 2018年06月28日 16:16, Andrei Borzenkov wrote:

On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo 
wrote:



On 2018年06月28日 11:14, r...@georgianit.com wrote:



On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:



Please get yourself clear of what other raid1 is doing.


A drive failure, where the drive is still there when the computer
reboots, is a situation that *any* raid 1, (or for that matter,
raid 5, raid 6, anything but raid 0) will recover from perfectly
without raising a sweat. Some will rebuild the array automatically,


WOW, that's black magic, at least for RAID1.
The whole RAID1 has no idea of which copy is correct unlike btrfs who
has datasum.

Don't bother other things, just tell me how to determine which one is
correct?



When one drive fails, it is recorded in meta-data on remaining drives;
probably configuration generation number is increased. Next time drive
with older generation is not incorporated. Hardware controllers also
keep this information in NVRAM and so do not even depend on scanning
of other disks.


Yep, the only possible way to determine such case is from external info.

For device generation, it's possible to enhance btrfs, but at least we
could start from detect and refuse to RW mount to avoid possible further
corruption.
But anyway, if one really cares about such case, hardware RAID
controller seems to be the only solution as other software may have the
same problem.

LVM doesn't.  It detects that one of the devices was gone for some
period of time and marks the volume as degraded (and _might_, depending
on how you have things configured, automatically re-sync).  Not sure
about MD, but I am willing to bet it properly detects this type of
situation too.


And the hardware solution looks pretty interesting, is the write to
NVRAM 100% atomic? Even at power loss?

On a proper RAID controller, it's battery backed, and that battery
backing provides enough power to also make sure that the state change is
properly recorded in the event of power loss.


Well, that explains a lot of thing.

So hardware RAID controller can be considered having a special battery
(always atomic) journal device.
If we can't provide UPS for the whole system, a battery powered journal
device indeed makes sense.






The only possibility is that, the misbehaved device missed several
super
block update so we have a chance to detect it's out-of-date.
But that's not always working.



Why it should not work as long as any write to array is suspended
until superblock on remaining devices is updated?


What happens if there is no generation gap in device superblock?

If one device got some of its (nodatacow) data written to disk, while
the other device doesn't get data written, and neither of them reached
super block update, there is no difference in device superblock, thus no
way to detect which is correct.

Yes, but that should be a very small window (at least, once we finally
quit serializing writes across devices), and it's a problem on existing
RAID1 implementations too (and therefore isn't something we should be
using as an excuse for not doing this).





If you're talking about missing generation check for btrfs, that's
valid, but it's far from a "major design flaw", as there are a lot of
cases where other RAID1 (mdraid or LVM mirrored) can also be affected
(the brain-split case).



That's different. Yes, with software-based raid there is usually no
way to detect outdated copy if no other copies are present. Having
older valid data is still very different from corrupting newer data.


While for VDI case (or any VM image file format other than raw), older
valid data normally means corruption.
Unless they have their own write-ahead log.

Some file format may detect such problem by themselves if they have
internal checksum, but anyway, older data normally means corruption,
especially when partial new and partial old.

On the other hand, with data COW and csum, btrfs can ensure the whole
filesystem update is atomic (at least for single device).
So the title, especially the "major design flaw" can't be wrong any more.

The title is excessive, but I'd agree it's a design flaw that BTRFS
doesn't at least notice that the generation ID's are different and
preferentially trust the device with the newer generation ID.


Well, a design flaw should be something that can't be easily fixed
without *huge* on-disk format or behavior change.
Flaw in btrfs' one-subvolume-per-tree metadata design or current extent
booking behavior could be called design flaw.
That would be a structural design flaw.  it's a result of how the 
software is structured.  There are other types of design flaws though.


While for things like this, just as the submitted RFC patch, less than
100 lines could change the behavior.
I would still consider this case a design flaw (a purely behavioral one 
not tied to how 

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Qu Wenruo


On 2018年06月28日 19:12, Austin S. Hemmelgarn wrote:
> On 2018-06-28 05:15, Qu Wenruo wrote:
>>
>>
>> On 2018年06月28日 16:16, Andrei Borzenkov wrote:
>>> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo 
>>> wrote:


 On 2018年06月28日 11:14, r...@georgianit.com wrote:
>
>
> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>
>>
>> Please get yourself clear of what other raid1 is doing.
>
> A drive failure, where the drive is still there when the computer
> reboots, is a situation that *any* raid 1, (or for that matter,
> raid 5, raid 6, anything but raid 0) will recover from perfectly
> without raising a sweat. Some will rebuild the array automatically,

 WOW, that's black magic, at least for RAID1.
 The whole RAID1 has no idea of which copy is correct unlike btrfs who
 has datasum.

 Don't bother other things, just tell me how to determine which one is
 correct?

>>>
>>> When one drive fails, it is recorded in meta-data on remaining drives;
>>> probably configuration generation number is increased. Next time drive
>>> with older generation is not incorporated. Hardware controllers also
>>> keep this information in NVRAM and so do not even depend on scanning
>>> of other disks.
>>
>> Yep, the only possible way to determine such case is from external info.
>>
>> For device generation, it's possible to enhance btrfs, but at least we
>> could start from detect and refuse to RW mount to avoid possible further
>> corruption.
>> But anyway, if one really cares about such case, hardware RAID
>> controller seems to be the only solution as other software may have the
>> same problem.
> LVM doesn't.  It detects that one of the devices was gone for some
> period of time and marks the volume as degraded (and _might_, depending
> on how you have things configured, automatically re-sync).  Not sure
> about MD, but I am willing to bet it properly detects this type of
> situation too.
>>
>> And the hardware solution looks pretty interesting, is the write to
>> NVRAM 100% atomic? Even at power loss?
> On a proper RAID controller, it's battery backed, and that battery
> backing provides enough power to also make sure that the state change is
> properly recorded in the event of power loss.

Well, that explains a lot of thing.

So hardware RAID controller can be considered having a special battery
(always atomic) journal device.
If we can't provide UPS for the whole system, a battery powered journal
device indeed makes sense.

>>
>>>
 The only possibility is that, the misbehaved device missed several
 super
 block update so we have a chance to detect it's out-of-date.
 But that's not always working.

>>>
>>> Why it should not work as long as any write to array is suspended
>>> until superblock on remaining devices is updated?
>>
>> What happens if there is no generation gap in device superblock?
>>
>> If one device got some of its (nodatacow) data written to disk, while
>> the other device doesn't get data written, and neither of them reached
>> super block update, there is no difference in device superblock, thus no
>> way to detect which is correct.
> Yes, but that should be a very small window (at least, once we finally
> quit serializing writes across devices), and it's a problem on existing
> RAID1 implementations too (and therefore isn't something we should be
> using as an excuse for not doing this).
>>
>>>
 If you're talking about missing generation check for btrfs, that's
 valid, but it's far from a "major design flaw", as there are a lot of
 cases where other RAID1 (mdraid or LVM mirrored) can also be affected
 (the brain-split case).

>>>
>>> That's different. Yes, with software-based raid there is usually no
>>> way to detect outdated copy if no other copies are present. Having
>>> older valid data is still very different from corrupting newer data.
>>
>> While for VDI case (or any VM image file format other than raw), older
>> valid data normally means corruption.
>> Unless they have their own write-ahead log.
>>
>> Some file format may detect such problem by themselves if they have
>> internal checksum, but anyway, older data normally means corruption,
>> especially when partial new and partial old.
>>
>> On the other hand, with data COW and csum, btrfs can ensure the whole
>> filesystem update is atomic (at least for single device).
>> So the title, especially the "major design flaw" can't be wrong any more.
> The title is excessive, but I'd agree it's a design flaw that BTRFS
> doesn't at least notice that the generation ID's are different and
> preferentially trust the device with the newer generation ID.

Well, a design flaw should be something that can't be easily fixed
without *huge* on-disk format or behavior change.
Flaw in btrfs' one-subvolume-per-tree metadata design or current extent
booking behavior could be called design flaw.

While for things like this, just as the submitted RFC 

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Austin S. Hemmelgarn

On 2018-06-28 05:15, Qu Wenruo wrote:



On 2018年06月28日 16:16, Andrei Borzenkov wrote:

On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo  wrote:



On 2018年06月28日 11:14, r...@georgianit.com wrote:



On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:



Please get yourself clear of what other raid1 is doing.


A drive failure, where the drive is still there when the computer reboots, is a 
situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but 
raid 0) will recover from perfectly without raising a sweat. Some will rebuild 
the array automatically,


WOW, that's black magic, at least for RAID1.
The whole RAID1 has no idea of which copy is correct unlike btrfs who
has datasum.

Don't bother other things, just tell me how to determine which one is
correct?



When one drive fails, it is recorded in meta-data on remaining drives;
probably configuration generation number is increased. Next time drive
with older generation is not incorporated. Hardware controllers also
keep this information in NVRAM and so do not even depend on scanning
of other disks.


Yep, the only possible way to determine such case is from external info.

For device generation, it's possible to enhance btrfs, but at least we
could start from detect and refuse to RW mount to avoid possible further
corruption.
But anyway, if one really cares about such case, hardware RAID
controller seems to be the only solution as other software may have the
same problem.
LVM doesn't.  It detects that one of the devices was gone for some 
period of time and marks the volume as degraded (and _might_, depending 
on how you have things configured, automatically re-sync).  Not sure 
about MD, but I am willing to bet it properly detects this type of 
situation too.


And the hardware solution looks pretty interesting, is the write to
NVRAM 100% atomic? Even at power loss?
On a proper RAID controller, it's battery backed, and that battery 
backing provides enough power to also make sure that the state change is 
properly recorded in the event of power loss.





The only possibility is that, the misbehaved device missed several super
block update so we have a chance to detect it's out-of-date.
But that's not always working.



Why it should not work as long as any write to array is suspended
until superblock on remaining devices is updated?


What happens if there is no generation gap in device superblock?

If one device got some of its (nodatacow) data written to disk, while
the other device doesn't get data written, and neither of them reached
super block update, there is no difference in device superblock, thus no
way to detect which is correct.
Yes, but that should be a very small window (at least, once we finally 
quit serializing writes across devices), and it's a problem on existing 
RAID1 implementations too (and therefore isn't something we should be 
using as an excuse for not doing this).





If you're talking about missing generation check for btrfs, that's
valid, but it's far from a "major design flaw", as there are a lot of
cases where other RAID1 (mdraid or LVM mirrored) can also be affected
(the brain-split case).



That's different. Yes, with software-based raid there is usually no
way to detect outdated copy if no other copies are present. Having
older valid data is still very different from corrupting newer data.


While for VDI case (or any VM image file format other than raw), older
valid data normally means corruption.
Unless they have their own write-ahead log.

Some file format may detect such problem by themselves if they have
internal checksum, but anyway, older data normally means corruption,
especially when partial new and partial old.

On the other hand, with data COW and csum, btrfs can ensure the whole
filesystem update is atomic (at least for single device).
So the title, especially the "major design flaw" can't be wrong any more.
The title is excessive, but I'd agree it's a design flaw that BTRFS 
doesn't at least notice that the generation ID's are different and 
preferentially trust the device with the newer generation ID. The only 
special handling I can see that would be needed is around volumes 
mounted with the `nodatacow` option, which may not see generation 
changes for a very long time otherwise.





others will automatically kick out the misbehaving drive.  *none* of them will 
take back the the drive with old data and start commingling that data with good 
copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the 
most basic expectations of RAID.


RAID1 can only tolerate 1 missing device, it has nothing to do with
error detection.
And it's impossible to detect such case without extra help.

Your expectation is completely wrong.



Well ... somehow it is my experience as well ... :)


Acceptable, but not really apply to software based RAID1.

Thanks,
Qu





I'm not the one who has to clear his expectations here.

--
To unsubscribe from this list: send the line "unsubscribe 

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Qu Wenruo


On 2018年06月28日 16:16, Andrei Borzenkov wrote:
> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo  wrote:
>>
>>
>> On 2018年06月28日 11:14, r...@georgianit.com wrote:
>>>
>>>
>>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>>

 Please get yourself clear of what other raid1 is doing.
>>>
>>> A drive failure, where the drive is still there when the computer reboots, 
>>> is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, 
>>> anything but raid 0) will recover from perfectly without raising a sweat. 
>>> Some will rebuild the array automatically,
>>
>> WOW, that's black magic, at least for RAID1.
>> The whole RAID1 has no idea of which copy is correct unlike btrfs who
>> has datasum.
>>
>> Don't bother other things, just tell me how to determine which one is
>> correct?
>>
> 
> When one drive fails, it is recorded in meta-data on remaining drives;
> probably configuration generation number is increased. Next time drive
> with older generation is not incorporated. Hardware controllers also
> keep this information in NVRAM and so do not even depend on scanning
> of other disks.

Yep, the only possible way to determine such case is from external info.

For device generation, it's possible to enhance btrfs, but at least we
could start from detect and refuse to RW mount to avoid possible further
corruption.
But anyway, if one really cares about such case, hardware RAID
controller seems to be the only solution as other software may have the
same problem.

And the hardware solution looks pretty interesting, is the write to
NVRAM 100% atomic? Even at power loss?

> 
>> The only possibility is that, the misbehaved device missed several super
>> block update so we have a chance to detect it's out-of-date.
>> But that's not always working.
>>
> 
> Why it should not work as long as any write to array is suspended
> until superblock on remaining devices is updated?

What happens if there is no generation gap in device superblock?

If one device got some of its (nodatacow) data written to disk, while
the other device doesn't get data written, and neither of them reached
super block update, there is no difference in device superblock, thus no
way to detect which is correct.

> 
>> If you're talking about missing generation check for btrfs, that's
>> valid, but it's far from a "major design flaw", as there are a lot of
>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
>> (the brain-split case).
>>
> 
> That's different. Yes, with software-based raid there is usually no
> way to detect outdated copy if no other copies are present. Having
> older valid data is still very different from corrupting newer data.

While for VDI case (or any VM image file format other than raw), older
valid data normally means corruption.
Unless they have their own write-ahead log.

Some file format may detect such problem by themselves if they have
internal checksum, but anyway, older data normally means corruption,
especially when partial new and partial old.

On the other hand, with data COW and csum, btrfs can ensure the whole
filesystem update is atomic (at least for single device).
So the title, especially the "major design flaw" can't be wrong any more.

> 
>>> others will automatically kick out the misbehaving drive.  *none* of them 
>>> will take back the the drive with old data and start commingling that data 
>>> with good copy.)\ This behaviour from BTRFS is completely abnormal.. and 
>>> defeats even the most basic expectations of RAID.
>>
>> RAID1 can only tolerate 1 missing device, it has nothing to do with
>> error detection.
>> And it's impossible to detect such case without extra help.
>>
>> Your expectation is completely wrong.
>>
> 
> Well ... somehow it is my experience as well ... :)

Acceptable, but not really apply to software based RAID1.

Thanks,
Qu

> 
>>>
>>> I'm not the one who has to clear his expectations here.
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>



signature.asc
Description: OpenPGP digital signature


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Andrei Borzenkov
On Thu, Jun 28, 2018 at 11:16 AM, Andrei Borzenkov  wrote:
> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo  wrote:
>>
>>
>> On 2018年06月28日 11:14, r...@georgianit.com wrote:
>>>
>>>
>>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>>

 Please get yourself clear of what other raid1 is doing.
>>>
>>> A drive failure, where the drive is still there when the computer reboots, 
>>> is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, 
>>> anything but raid 0) will recover from perfectly without raising a sweat. 
>>> Some will rebuild the array automatically,
>>
>> WOW, that's black magic, at least for RAID1.
>> The whole RAID1 has no idea of which copy is correct unlike btrfs who
>> has datasum.
>>
>> Don't bother other things, just tell me how to determine which one is
>> correct?
>>
>
> When one drive fails, it is recorded in meta-data on remaining drives;
> probably configuration generation number is increased. Next time drive
> with older generation is not incorporated. Hardware controllers also
> keep this information in NVRAM and so do not even depend on scanning
> of other disks.
>
>> The only possibility is that, the misbehaved device missed several super
>> block update so we have a chance to detect it's out-of-date.
>> But that's not always working.
>>
>
> Why it should not work as long as any write to array is suspended
> until superblock on remaining devices is updated?
>
>> If you're talking about missing generation check for btrfs, that's
>> valid, but it's far from a "major design flaw", as there are a lot of
>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
>> (the brain-split case).
>>
>
> That's different. Yes, with software-based raid there is usually no
> way to detect outdated copy if no other copies are present. Having
> older valid data is still very different from corrupting newer data.
>
>>> others will automatically kick out the misbehaving drive.  *none* of them 
>>> will take back the the drive with old data and start commingling that data 
>>> with good copy.)\ This behaviour from BTRFS is completely abnormal.. and 
>>> defeats even the most basic expectations of RAID.
>>
>> RAID1 can only tolerate 1 missing device, it has nothing to do with
>> error detection.
>> And it's impossible to detect such case without extra help.
>>
>> Your expectation is completely wrong.
>>
>
> Well ... somehow it is my experience as well ... :)

s/experience/expectation/

sorry.

>
>>>
>>> I'm not the one who has to clear his expectations here.
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Andrei Borzenkov
On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo  wrote:
>
>
> On 2018年06月28日 11:14, r...@georgianit.com wrote:
>>
>>
>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>
>>>
>>> Please get yourself clear of what other raid1 is doing.
>>
>> A drive failure, where the drive is still there when the computer reboots, 
>> is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, 
>> anything but raid 0) will recover from perfectly without raising a sweat. 
>> Some will rebuild the array automatically,
>
> WOW, that's black magic, at least for RAID1.
> The whole RAID1 has no idea of which copy is correct unlike btrfs who
> has datasum.
>
> Don't bother other things, just tell me how to determine which one is
> correct?
>

When one drive fails, it is recorded in meta-data on remaining drives;
probably configuration generation number is increased. Next time drive
with older generation is not incorporated. Hardware controllers also
keep this information in NVRAM and so do not even depend on scanning
of other disks.

> The only possibility is that, the misbehaved device missed several super
> block update so we have a chance to detect it's out-of-date.
> But that's not always working.
>

Why it should not work as long as any write to array is suspended
until superblock on remaining devices is updated?

> If you're talking about missing generation check for btrfs, that's
> valid, but it's far from a "major design flaw", as there are a lot of
> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
> (the brain-split case).
>

That's different. Yes, with software-based raid there is usually no
way to detect outdated copy if no other copies are present. Having
older valid data is still very different from corrupting newer data.

>> others will automatically kick out the misbehaving drive.  *none* of them 
>> will take back the the drive with old data and start commingling that data 
>> with good copy.)\ This behaviour from BTRFS is completely abnormal.. and 
>> defeats even the most basic expectations of RAID.
>
> RAID1 can only tolerate 1 missing device, it has nothing to do with
> error detection.
> And it's impossible to detect such case without extra help.
>
> Your expectation is completely wrong.
>

Well ... somehow it is my experience as well ... :)

>>
>> I'm not the one who has to clear his expectations here.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-27 Thread Qu Wenruo


On 2018年06月28日 11:14, r...@georgianit.com wrote:
> 
> 
> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
> 
>>
>> Please get yourself clear of what other raid1 is doing.
> 
> A drive failure, where the drive is still there when the computer reboots, is 
> a situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything 
> but raid 0) will recover from perfectly without raising a sweat. Some will 
> rebuild the array automatically,

WOW, that's black magic, at least for RAID1.
The whole RAID1 has no idea of which copy is correct unlike btrfs who
has datasum.

Don't bother other things, just tell me how to determine which one is
correct?

The only possibility is that, the misbehaved device missed several super
block update so we have a chance to detect it's out-of-date.
But that's not always working.

If you're talking about missing generation check for btrfs, that's
valid, but it's far from a "major design flaw", as there are a lot of
cases where other RAID1 (mdraid or LVM mirrored) can also be affected
(the brain-split case).

> others will automatically kick out the misbehaving drive.  *none* of them 
> will take back the the drive with old data and start commingling that data 
> with good copy.)\ This behaviour from BTRFS is completely abnormal.. and 
> defeats even the most basic expectations of RAID.

RAID1 can only tolerate 1 missing device, it has nothing to do with
error detection.
And it's impossible to detect such case without extra help.

Your expectation is completely wrong.

> 
> I'm not the one who has to clear his expectations here.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



signature.asc
Description: OpenPGP digital signature


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-27 Thread remi



On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:

> 
> Please get yourself clear of what other raid1 is doing.

A drive failure, where the drive is still there when the computer reboots, is a 
situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but 
raid 0) will recover from perfectly without raising a sweat. Some will rebuild 
the array automatically, others will automatically kick out the misbehaving 
drive.  *none* of them will take back the the drive with old data and start 
commingling that data with good copy.)\ This behaviour from BTRFS is completely 
abnormal.. and defeats even the most basic expectations of RAID.

I'm not the one who has to clear his expectations here.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-27 Thread Qu Wenruo


On 2018年06月28日 10:10, Remi Gauvin wrote:
> On 2018-06-27 09:58 PM, Qu Wenruo wrote:
>>
>>
>> On 2018年06月28日 09:42, Remi Gauvin wrote:
>>> There seems to be a major design flaw with BTRFS that needs to be better
>>> documented, to avoid massive data loss.
>>>
>>> Tested with Raid 1 on Ubuntu Kernel 4.15
>>>
>>> The use case being tested was a Virtualbox VDI file created with
>>> NODATACOW attribute, (as is often suggested, due to the painful
>>> performance penalty of COW on these files.)
>>
>> NODATACOW implies NODATASUM.
>>
> 
> yes yes,, none of which changes the simple fact that if you use this
> option, which is often touted as outright necessary for some types of
> files, BTRFS raid is worse than useless,, not only will it not protect
> your data at all from bitrot, (as expected), it will actively go out of
> it's way to corrupt it!
> 
> This is not expected behaviour from 'Raid', and I despair that seems to
> be something that I have to explain!

Nope, all normal raid1 is the same, if you corrupt one copy, you won't
know which one is correct.
Btrfs csum is already doing much better job than plain raid1.

Please get yourself clear of what other raid1 is doing.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-27 Thread Remi Gauvin
On 2018-06-27 09:58 PM, Qu Wenruo wrote:
> 
> 
> On 2018年06月28日 09:42, Remi Gauvin wrote:
>> There seems to be a major design flaw with BTRFS that needs to be better
>> documented, to avoid massive data loss.
>>
>> Tested with Raid 1 on Ubuntu Kernel 4.15
>>
>> The use case being tested was a Virtualbox VDI file created with
>> NODATACOW attribute, (as is often suggested, due to the painful
>> performance penalty of COW on these files.)
> 
> NODATACOW implies NODATASUM.
> 

yes yes,, none of which changes the simple fact that if you use this
option, which is often touted as outright necessary for some types of
files, BTRFS raid is worse than useless,, not only will it not protect
your data at all from bitrot, (as expected), it will actively go out of
it's way to corrupt it!

This is not expected behaviour from 'Raid', and I despair that seems to
be something that I have to explain!




signature.asc
Description: OpenPGP digital signature


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-27 Thread Qu Wenruo


On 2018年06月28日 09:42, Remi Gauvin wrote:
> There seems to be a major design flaw with BTRFS that needs to be better
> documented, to avoid massive data loss.
> 
> Tested with Raid 1 on Ubuntu Kernel 4.15
> 
> The use case being tested was a Virtualbox VDI file created with
> NODATACOW attribute, (as is often suggested, due to the painful
> performance penalty of COW on these files.)

NODATACOW implies NODATASUM.

From btrfs(5):
---
Enable data copy-on-write for newly created files.  Nodatacow
implies nodatasum, and disables compression. All files created
under nodatacow are also set the NOCOW file attribute (see
chattr(1)).
---

Although it's talking about the mount option, it also applies to
per-inode options.

Thanks,
Qu

> 
> However, if a device is temporarily dropped (this in case, tested by
> disconnecting drives.) and re-connects automatically next boot, BTRFS
> does not in any way synchronize the VDI file, or have any means to know
> that one of copy is out of date and bad.
> 
> The result of trying to use said VDI file is interestingly insane.
> Scrub did not do anything to rectify the situation.
> 
> 



signature.asc
Description: OpenPGP digital signature


Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-27 Thread Remi Gauvin
There seems to be a major design flaw with BTRFS that needs to be better
documented, to avoid massive data loss.

Tested with Raid 1 on Ubuntu Kernel 4.15

The use case being tested was a Virtualbox VDI file created with
NODATACOW attribute, (as is often suggested, due to the painful
performance penalty of COW on these files.)

However, if a device is temporarily dropped (this in case, tested by
disconnecting drives.) and re-connects automatically next boot, BTRFS
does not in any way synchronize the VDI file, or have any means to know
that one of copy is out of date and bad.

The result of trying to use said VDI file is interestingly insane.
Scrub did not do anything to rectify the situation.


<>