Chris Murphy posted on Tue, 26 Apr 2016 18:58:06 -0600 as excerpted: > On Tue, Apr 26, 2016 at 5:44 AM, Juan Alberto Cirez > <jaci...@rdcsafety.com> wrote:
>> RAID10 configuration, on the other hand, requires a minimum of four >> HDD, but it stripes data across mirrored pairs. As long as one disk in >> each mirrored pair is functional, data can be retrieved. > > Not Btrfs raid10. It's not the devices that are mirrored pairs, but > rather the chunks. There's no way to control or determine on what > devices the pairs are on. It's certain you get at least a partial > failure (data for sure and likely metadata if it's also using raid10 > profile) of the volume if you lose more than 1 device, planning wise you > have to assume you lose the entire array. Primarily quoting and restating the above (and below) to emphasize it. Remember: * btrfs raid is chunk-level, *NOT* device-level. That has important implications in terms of recovery from degraded. * btrfs parity-raid (raid56 mode) isn't yet mature and definitely nothing I'd trust in production. * btrfs redundancy-raid (raid1 and raid10 modes, as well as dup-mode on a single device) are precisely pair-copy -- two copies, with the raid modes forcing each copy to a different device or set of devices. More devices simply means more space, *NOT* more redundancy/copies. Again, these copies are at the chunk level. The chunks can and will be distributed across devices based on most space available, meaning loss of more than one device will in most cases kill the array. Because mirror- pairs happen at the chunk, not the device level, there is no such thing as loss of only one mirror in the mirror pair allowing more than a single device to fail, because statistically, the chances of both copies of some chunks being on those two now failed/missing devices is pretty high. * btrfs raid10 stripes N/2-way, while only duplicating exactly two-way. So a six-device raid10 will stripe three devices per mirror, while a 5- device raid10 will stripe 2 devices per mirror, with the odd device out being on a different device for each new chunk, due to the most-space- left allocation algorithm. >> With GlusterFS as a distributed volume, the files are already spread >> among the servers causing file I/O to be spread fairly evenly among >> them as well, thus probably providing the benefit one might expect with >> stripe (RAID10). > > Yes, the raid1 of Btrfs is just so you don't have to rebuild volumes if > you lose a drive. But since raid1 is not n-way copies, and only means > two copies, you don't really want the file systems getting that big or > you increase the chances of a double failure. Again emphasizing. Since you're running a distributed filesystem on top, keep the lower level btrfs raids small and do more of them, multiple btrfs raid bricks per machine even, as long as your distributed level is specced to be able to lose the bricks of at least one entire machine, of course. OTOH, unlike traditional raid, btrfs does actual checksumming and data/ metadata integrity at the block level, and can and will detect integrity issues and correct from the second copy when the raid level supplies one, assuming it's good of course. That should fix problems at the lower level that other filesystems wouldn't, meaning less problems ever reach the distributed level in the first place. Thus, also emphasizing something Austin suggested. You may wish to consider btrfs raid1 on top of a pair of mdraid or dmraid raid0s. As you are likely well aware, normally, raid1 on top of raid0 is called raid01 and is discouraged in favor of raid10 (raid0 on top of raid1) for rebuild from lost device state efficiency reasons (with raid1 underneath, the rebuild of a lost device is localized to the presumably two-device raid1, with raid1 on top, the whole raid0 stripe must be rebuilt, and that's normally at the whole-device level) Of course putting the btrfs raid1 on top reverses this and would *normally* be discouraged as raid01, but btrfs raid1's operational data integrity handling, while not getting away from having to rebuild the whole raid0 stripe from the other one, does mean that gets done for an individual bad block -- no whole device failure necessary. And of course you can't get that putting btrfs raid0 on top and get that, since then the underneath raid1 layer won't be doing that integrity verification, and if that bad block happens to be returned by the underlying raid1 layer, the btrfs raid0 will simply fail the verification and error out that read, despite another good copy on the underlying raid1, because btrfs won't know anything about it. Meanwhile, as Austin says, btrfs' A/B copy read scheduling is... unoptimized. Basically, it's simple even/odd PID based, so a single read thread will always hit the same copy, leaving the other one idle. I've argued before that precisely that is a very good indication of where the btrfs devs themselves think btrfs is at, as it's clearly suboptimal, while there are much better scheduling examples, including the mdraid read-scheduling code, praised for its efficiency, in the kernel, and failure to optimize must then be considered either simply lacking the time due to higher priority development and bugfixing tasks, or an avoidance of the dangers of "premature optimization". In either case, that such unoptimized code remains in such a highly visible and performance critical place is an extremely strong indicator that btrfs devs themselves don't consider btrfs a stable and mature filesystem yet. And putting a pair of md/dm raid0s below that btrfs raid1, both helps to make up a bit for the btrfs raid1 braindead read-scheduling, and lets you exploit btrfs raid1's data integrity features. Of course it also forces btrfs to a more deterministic distribution of those chunk copies, so you can loose up to all the devices in one of those raid0s, as long as the other one remains functional, but that's nothing to really count on, so you still plan for single device failure redundancy only at the individual brick level, and use the distributed filesystem layer to deal with whole brick failure above that. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html