Ellis H. Wilson III posted on Wed, 14 Feb 2018 11:00:29 -0500 as
> Hi again -- back with a few more questions:
> Frame-of-reference here: RAID0. Around 70TB raw capacity. No
> compression. No quotas enabled. Many (potentially tens to hundreds) of
> subvolumes, each with tens of snapshots. No control over size or number
> of files, but directory tree (entries per dir and general tree depth)
> can be controlled in case that's helpful).
?? How can you control both breadth (entries per dir) AND depth of
directory tree without ultimately limiting your number of files?
Or do you mean you can control breadth XOR depth of tree as needed,
allowing the other to expand as necessary to accommodate the uncontrolled
number of files?
Anyway, AFAIK the only performance issue there would be the (IIRC) ~65535
limit on directory hard links before additional ones are out-of-lined
into a secondary node, with the entailing performance implications.
> 1. I've been reading up about the space cache, and it appears there is a
> v2 of it called the free space tree that is much friendlier to large
> filesystems such as the one I am designing for. It is listed as OK/OK
> on the wiki status page, but there is a note that btrfs progs treats it
> as read only (i.e., btrfs check repair cannot help me without a full
> space cache rebuild is my biggest concern) and the last status update on
> this I can find was circa fall 2016. Can anybody give me an updated
> status on this feature? From what I read, v1 and tens of TB filesystems
> will not play well together, so I'm inclined to dig into this.
At tens of TB, yes, the free-space-cache (v1) has issues that the free-
space-tree (aka free-space-cache-v2) are designed to solve. And v2
should be very well tested in large enterprise installations by now,
given facebook's usage and intimate involvement with btrfs.
But I have an arguably more basic concern... Pardon me for reviewing the
basics as I feel rather like a pupil attempting to lecture a teacher on
the point and you could very likely teach /me/ about them, but they setup
Raid0, particularly at the 10s-of-TB scale, has some implications that
don't particularly well match your specified concerns above.
Of course "raid0" is a convenient misnomer, as there's nothing
"redundant" about the "array of independent devices" in a raid0
configuration, it's simply done for the space and speed features, with
the sacrificial tradeoff being reliability. It's only called raid0 as a
convenience, allowing it to be grouped with the other raid configurations
where "redundant" /is/ a feature, with the more important grouping
commonality being they're all multi-device.
Because reliability /is/ the sacrificial tradeoff for raid0, it's
relatively safe to make the assumption that reliability either isn't
needed at all because the data literally is "throw-away" value (cache,
say, where refilling the cache isn't a big cost or time factor), or
reliability is assured by other mechanisms, backups being the most basic
but there are others like multi-layered raid, etc, which in practice
makes at least the particular instance of the data on the raid0 "throw-
away" value, even if the data as a whole is not.
So far, so good. But then above you mention concern about btrfs-progs
treating the free-space-tree (free-space-cache-v2) as read-only, and the
time cost of having to clear and rebuild it after a btrfs check --repair.
Which is what triggered the mismatch warning I mentioned above. Either
that raid0 data is of throw-away value appropriate to placement on a
raid0, and btrfs check --repair is of little concern as the benefits are
questionable (no guarantees it'll work and the data is either directly
throw-away value anyway, or there's a backup at hand that /does/ have a
tested guarantee of viability, or it's not worthy of being called a
backup in the first place), or it's not.
It's that concern about the viability of btrfs check --repair on what
you're defining as throw-away data by placing it on raid0 in the first
place, that's raising all those red warning flags for me! And the fact
that you didn't even bother to explain it with a side note to the effect
that the reliability is addressed some other way, but you still need to
worry about btrfs check --repair viability because $REASONS, is turning
those red flags into flashing red lights accompanied by blaring sirens!
OK, so let's assume you /do/ have a tested backup, ready to go. Then the
viability of btrfs check --repair is of less concern, but remains
something you might still be interested in for trivial cases, because
let's face it, transferring tens of TB of data, even if ready at hand,
does take time, and if you can avoid it because the btrfs check --repair
fix is trivial, it's worth doing so.
Valid case, but there's nothing in your post indicating it's valid as
Of course the other possibility is live-failover, which is sure to be
facebook's use-case. But with live-failover, the viability of btrfs
check --repair more or less ceases to be of interest, because the failover
happens (relative to the offline check or restore time) instantly, and
once the failed devices/machine is taken out of service it's far more
effective to simply blow away the filesystem (if not replacing the
device(s) entirely) and restore "at leisure" from backup, a relatively
guaranteed procedure compared to the "no guarantees" of attempting to
check --repair the filesystem out of trouble.
Which is very likely why the free-space-tree still isn't well supported
by btrfs-progs, including btrfs check, several kernel (and thus -progs)
development cycles later. The people who really need the one (whichever
one of the two)... don't tend to (or at least /shouldn't/) make use of
the other so much.
It's also worth mentioning that btrfs raid0 mode, as well as single mode,
hobbles the btrfs data and metadata integrity feature, because while
checksums can and are still generated, stored and checked by default, and
integrity problems can still be detected, because raid0 (and single)
includes no redundancy, there's no second copy (raid1/10) or parity
redundancy (raid5/6) to rebuild the bad data from, so it's simply gone.
(Well, for data you can try btrfs restore of the otherwise inaccessible
file and hope for the best, and for metadata, you can try check --repair
and again hope for the best, but...) If you're using that feature of
btrfs and want/need more than just detection of a problem that can't be
fixed due to lack of redundancy, there's a good chance you want a real
redundancy raid mode on multi-device, or dup mode on single device.
So bottom line... given the sacrificial lack of redundancy and
reliability of raid0, btrfs or not, in an enterprise setting with tens of
TB of data, why are you worrying about the viability of btrfs check --
repair on what the placement on raid0 decrees to be throw-away data
anyway? At first glance anyway, one or the other, either the raid0 mode
and thus declared throw-away value of tens of TB of data, or the
viability of btrfs check --repair, indicating you don't consider the data
you just declared to be of throw-away value by placing it on raid0, to be
of throw-away value after all, must be wrong. Which one is wrong is your
call, and there's certainly individual cases (one of which I even named)
where concern about the viability of btrfs check --repair on raid0 might
be valid, but your post has no real indication that your case is such a
case, and honestly, that worries me!
> 2. There's another thread on-going about mount delays. I've been
> completely blind to this specific problem until it caught my eye. Does
> anyone have ballpark estimates for how long very large HDD-based
> filesystems will take to mount? Yes, I know it will depend on the
> dataset. I'm looking for O() worst-case approximations for
> enterprise-grade large drives (12/14TB), as I expect it should scale
> with multiple drives so approximating for a single drive should be good
No input on that question here (my own use-case couldn't be more
different, multiple small sub-half-TB independent btrfs raid1s on
partitioned ssds), but another concern, based on real-world reports I've
12-14 TB individual drives?
While you /did/ say enterprise grade so this probably doesn't apply to
you, it might apply to others that will read this.
Be careful that you're not trying to use the "archive application"
targeted SMR drives for general purpose use. Occasionally people will
try to buy and use such drives in general purpose use due to their
cheaper per-TB cost, and it just doesn't go well. We've had a number of
reports of that. =:^(
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html