Re: ...in the matter of partition size

Duncan Wed, 27 Apr 2016 19:30:22 -0700

Juan Alberto Cirez posted on Wed, 27 Apr 2016 14:18:27 -0600 as excerpted:

> Quick question: Supposed I have n-number of storage pods (physical
> servers with n-number of physical hhds). The end deployment will be
> btrfs at the brick/block level with a distributed file system on top.
> Keeping in mind that my overriding goal is to have high availability and
> the mechanism whereby the lost of a drive or multiple drives in a single
> pod will not jeopardize data.
>>>>>>Question<<<<<
> Does partitioning the physical drives and creating btrfs filesystem on
> each partition, then configuring each partition as individual
> bricks/blocks offer ANY added benefits over grouping the entire pod into
> a drive pool and using that pool as a single block/brick to expose to
> the distributed filesystem?


I would say ... you're overlooking a level between the two options you 
suggested, which I'd consider to have some benefits under certain 
configurations.

You described a deployment with M number of individual physical server 
pods having N number of physical hdds each.  You then wondered about 
partitioning the individual hdds.  With a suitably high number for N (N=10
+ say, double digits, more than N=3, again, N being the number of 
physical hdds per physical server), I see no value in partitioning 
individual hdds, but there may well be reason to split N into multiple 
individual btrfs on the same physical server, as opposed to grouping them 
into a larger single btrfs with all hdds.

The problem is that btrfs raid1 and raid10 modes only do two-way-
mirroring, but that's two-way-mirroring at the individual chunk level, so 
unlike N-way mirroring raid1 or 2-way-mirroring device-level raid10, loss 
of more than one device is very likely to take out the whole filesystem.  
Thus, it can make sense to configure higher numbers of say 2-3 device 
raid1 btrfs or 4-6 device raid10 btrfs (or 4-9 device total hybrid btrfs 
raid1 on top of mdraid0, 2-3 hdds per mdraid0 device, 2-3 mdraid0 devices 
per btrfs raid1), as opposed to a single filesystem double-digit device 
btrfs raid1 or raid10.  Because 10+ devices btrfs raid1 or raid10 pools, 
risking a second device going out on the pool, is high risk, compared to 
multiple lower number of devices raid1 or raid10, where the risk of 
multiple devices going out on the individual btrfs is much smaller.

But at that level, I'd see no reason to actually partition individual 
hdds.  (Tho if you're limiting N to 2-4 hdds, there's a possible narrow 
case for it.)

Now what this effectively does is add another level to your distributed 
stack.  Instead of having the distributed level and the machine/brick 
level only, with the distributed level composed of multiple machine-
bricks and each machine-brick having a single filesystem pool, you now 
have the distributed level, the machine level, and the individual 
filesystem level, multiple but smaller pools on each machine.

That does change your global/distributed level strategy to some degree, 
because now loss of individual machines can take out multiple filesystem 
pools.  Basically, you need to configure your global level to tolerate 
loss of machine-multiples of the individual filesystems.  So if you're 
doing say three filesystems per physical machine, you will want to 
configure tolerance for at least three, and better six, of the individual 
filesystems, which will handle loss of one and two machines, respectively.

If you're looking at a massive enough number of hdds, say 20+ per 
machine, I'd configure them as mentioned in parenthesis above, btrfs 
raid1 (or raid10) over mdraid0 (or dmraid0 if you prefer), 2-3 physical 
hdds per mdraid0, 2-3 mdraid0s per btrfs raid1 or 4-6 mdraid0s per btrfs 
raid10, 2-3 btrfs per machine.  At the low end per machine that's two hdds 
per mdraid0, two mdraid0s per btrfs raid1, two btrfs raid1 per machine, 
2*2*2=8 hdds per machine, tho at only 8 I'd probably go 9 and do 3 hdd 
mdraid0, 3 mdraid0 btrfs, single btrfs per machine.  At the top end, 
that's 3 hdds per mdraid0, 6 mdraid0s per btrfs raid10, 3 btrfs per 
machine, 3*6*3=54 hdds per physical machine.

If you then configure the global level to tolerate a loss of 6 btrfs, 2 
full machines, with a 20 machine network, that's 54*20=1080 hdds, with a 
minimum loss tolerance of 10%, 108 hdds, 6 btrfs or two full machines.

Sounds like the sort of storage a supercollider might require.  We had a 
guy in here that works for one, for a few months, don't know if he's 
still around.

It is worth noting in all this, however, and I've not seen this brought 
up directly to you yet, that btrfs isn't particularly stable and mature 
yet, tho it is stabilizing (with emphasis on the ING).  With enough 
redundancy and care, you may be fine, but the typical if it's not backed 
up, you don't value it enough to be worth the backup, rule, certainly 
applies, even more to btrfs than to fully mature and stable filesystems, 
and when you're looking at that sort of massive amounts of data, backups 
have a whole new set of problems, the biggest being simply moving that 
amount of data around in a timely enough manner that the backup isn't 
history before it's completed.  Redundancy can help, here, but it's still 
an issue, and the fact of the matter is, btrfs may simply not be at a 
suitable stability point for your usage requirements.

That's actually what the supercollider guy concluded, for his 
requirements, for the time being, btrfs may get there in some years, but 
the massive data rate requirements he had were simply too high to allow 
for timely backups, and he concluded that the risk without them due to 
btrfs' present stability level was simply too high for his needs.

zfs is a more mature filesystem with similar features, tho it has higher 
hardware requirements, in particular, massive amounts of ecc ram is 
strongly recommended on linux, tho less on solaris and I'm not sure on 
the bsds, and there's licensing issues with it on linux that may or may 
not be a problem, depending on how strict your work and legal environment 
is.

Alternatively, xfs is considered stable and mature on linux and is often 
used for huge storage needs, tho without the features of btrfs or zfs, 
and of course there's the standard ext4, again without the features.  
Personally, if I was working on a large project that needed the features 
and could handle the legal situation, but didn't consider btrfs suitably 
stable and mature, I'd go zfs.  If I didn't need the features, probably 
xfs, of course on top of mdraid or possibly hardware raid and mdraid 
hybrid, to get the multi-hdd coverage.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ...in the matter of partition size

Reply via email to