On 2015-01-02 12:45, Brendan Hide wrote:
On 2015/01/02 15:42, Austin S Hemmelgarn wrote:
On 2014-12-31 12:27, ashf...@whisperpc.com wrote:
I see this as a CRITICAL design flaw.  The reason for calling it
CRITICAL
is that System Administrators have been trained for >20 years that
RAID-10
can usually handle a dual-disk failure, but the BTRFS implementation has
effectively ZERO chance of doing so.
No, some rather simple math
That's the problem. The math isn't as simple as you'd expect:

The example below is probably a pathological case - but here goes. Let's
say in this 4-disk example that chunks are striped as d1,d2,d1,d2 where
d1 is the first bit of data and d2 is the second:
Chunk 1 might be striped across disks A,B,C,D d1,d2,d1,d2
Chunk 2 might be striped across disks B,C,A,D d3,d4,d3,d4
Chunk 3 might be striped across disks D,A,C,B d5,d6,d5,d6
Chunk 4 might be striped across disks A,C,B,D d7,d8,d7,d8
Chunk 5 might be striped across disks A,C,D,B d9,d10,d9,d10

Lose any two disks and you have a 50% chance on *each* chunk to have
lost that chunk. With traditional RAID10 you have a 50% chance of losing
the array entirely. With btrfs, the more data you have stored, the
chances get closer to 100% of losing *some* data in a 2-disk failure.

In the above example, losing A and B means you lose d3, d6, and d7
(which ends up being 60% of all chunks).
Losing A and C means you lose d1 (20% of all chunks).OK
Losing A and D means you lose d9 (20% of all chunks).
Losing B and C means you lose d10 (20% of all chunks).
Losing B and D means you lose d2 (20% of all chunks).
Losing C and D means you lose d4,d5, AND d8 (60% of all chunks)

The above skewed example has an average of 40% of all chunks failed. As
you add more data and randomise the allocation, this will approach 50% -
BUT, the chances of losing *some* data is already clearly shown to be
very close to 100%.

OK, I forgot about the randomization effect that the chunk allocation and freeing has. We really should slap a *BIG* warning label on that (and ideally find some better way to do it so it's more reliable).

As an aside, I've found that a BTRFS raid1 set on top of 2 LVM/MD RAID0 sets is actually faster than using a BTRFS raid10 set with the same number of disks (how much faster is workload dependent), and provides better guarantees than a BTRFS raid10 set.

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to