Re: Status of FST and mount times

Ellis H. Wilson III Thu, 15 Feb 2018 08:46:22 -0800

On 02/15/2018 01:14 AM, Chris Murphy wrote:

On Wed, Feb 14, 2018 at 9:00 AM, Ellis H. Wilson III <ell...@panasas.com> wrote:

Frame-of-reference here: RAID0.  Around 70TB raw capacity.  No compression.
No quotas enabled.  Many (potentially tens to hundreds) of subvolumes, each
with tens of snapshots.


Even if non-catastrophic to lose such a file system, it's big enough
to be tedious and take time to set it up again. I think it's worth
considering one of two things as alternatives:

a. metadata raid1, data single: you lose the striping performance of
raid0, and if it's not randomly filled you'll end up with some disk
contention for reads and writes *but* if you lose a drive you will not
lose the file system. Any missing files on the dead drive will result
in EIO (and I think also a kernel message with path to file), and so
you could just run a script to delete those files and replace them
with backup copies.

This option is on our roadmap for future releases of our parallel filesystem, but unfortunately we do not presently have the time to implementthe functionality to report from the manager of that btrfs filesystem tothe pfs manager that said files have gone missing. We will absolutelybe revisiting that as an option in early 2019, as replacing just onedisk instead of N is highly attractive. Waiting for EIO as you suggestin b is a non-starter for us, as we're working at scales sufficientlylarge that we don't want to wait for someone to stumble over a partiallydegraded file. Pro-active reporting is what's needed, and we'llimplement that Real Soon Now.

b. Variation on the above would be to put it behind glusterfs
replicated volume. Gluster getting EIO from a brick should cause it to
get a copy from another brick and then fix up the bad one
automatically. Or in your raid0 case, the whole volume is lost, and
glusterfs helps do the full rebuild over 3-7 days while you're still
able to access those 70TB of data normally. Of course, this option
requires having two 70TB storage bricks available.

See my email address, which may help understand why GlusterFS is anon-starter. Nevertheless, the idea is a fine one and we'll havesomething similar going on, but at higher raid levels and acrosstypically a dozen or more of such bricks.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Status of FST and mount times

Reply via email to