On 12/12/2014 08:45 AM, Zygo Blaxell wrote:
On Thu, Dec 11, 2014 at 10:01:06PM -0800, Robert White wrote:
So RAID5 with three media M is

M    MM   MMM
D1   D2   P(a)
D3   P(b) D4
P(c) D5   D6

RAID5 with two media is well defined, and looks like this:

M    MM
D1   P(a)
P(b) D2
D3   P(c)

Like I said in the other fork of this thread... I see (now) that the math works but I can find no trace of anyone having ever implemented this for arity less than 3 RAID greater than one paradigm (outside btrfs and its associated materials).

It's like talking about a two-wheeled tricycle. 8-)

I would _genuinely_ like to see any third party discussion of this. It just isn't done (probably because, as you've shown it just a really complicated and CPU intensive way to end up with a simple mirror). I spent several hours looking. I can see the math works, and I understand what you are doing (as I said at some length in the grandparent message) but it "just isn't done".

The reason I use the tricycle example is that, while most people know this instinctively few are aware of the fact that going from two wheels to three-or-more wheels reverses the steering paradigm. On a bike you push-left lean-left and go-left. At the higher arity vehicles (including adding a side-car to a bike) you push-right go left (you lean left too, but that's just to keep from nosing over 8-). I find that quite apt in the whole RAID1 vs RAID5 discussion since the former is about copying one-or-more times and the latter is about starting with a theoretically zeroed buffer and doing reversible checksumming into it.

I doubt that I will be the last person to be confused by BTRFS' implementation of a two-wheeled tricycle.

You're going to get a lot of mail over the years. 8-)


MEANWHILE

the system really needs to be able to explicitly express and support the "missing" media paradigm.

 M     x    MMM
 D1    .    P(a)
 D3    .    D4
 P(c)  .    D6

The correct logic here to "remove" (e.g. "replace with nothing" instead of "delete") a media just doesn't seem to exist. And it's already painfully missing in the RAID1 situation.

If I have a system with N SATA ports, and I have connected N drives, and device M is starting to fail... I need to be able to disconnect M and then connect M(new). Possibly with a non-trivial amount of time in there. For all RAID levels greater than zero this is a natural operation in a degraded mode. And for a nearly full filesystem the shrink operation that is btrfs device delete would not work. And for any nontrivially occupied fiesystem it would be way slow, and need to be reversed for another way-slow interval.

So I need to be able to "replace" a drive with a "nothing" so that the number of active media becomes N-1 but the arity remains N.

mdadm has the "missing" keyword. the Device Mapper has the "zero" target. As near as I can tell btrfs has got nothing in this functional slot.

Imagine, if you will, a block device that is the anti-/dev/null. All operations on this block device return EFAULT. lets call it /dev/nothing. And lets say I have a /dev/sdc that has to come out immediately (and all my stuff is RAID1/5/6). The operational chain would be

btrfs replace start /dev/sdc /dev/nothing /
(time pases, physical device is removed and replace)
btrfs replace start /dev/nothing /dev/sdc /

Now that's good-ish, but really the first replace is pernicious. The internal state for the filesystem should just be able to record that device id 3 (assuming /dev/sda is devid1 and b is 2 etc for this example) is just gone. The replace-with-nothing becomes more-or-less instant.

The first replace is also pernicious if its the second media failure on a fully RAID6 array since that would trying to put the same kernel level device in the array twice.

The restore operation, the replace of the nothing with the something, remains fully elaborate.

The "nothing" devices need to show up in the device id tables for a running array in their geographically correct positions and all that.

Without this "missing" status as a first-class part of the system, dealing with failures and communicating about those failures with the operator will become vexatious.


[The use of "device delete" and "device add" as changes in arity and size, and its inaplicability to cases where failure is being dealt with abent a change of arity, could be clearer in the documentation.]
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to