Re: degraded permanent mount option

Austin S. Hemmelgarn Tue, 30 Jan 2018 08:30:49 -0800

On 2018-01-30 10:09, Tomasz Pala wrote:

On Mon, Jan 29, 2018 at 08:42:32 -0500, Austin S. Hemmelgarn wrote:

Yes. They are stupid enough to fail miserably with any more complicated
setups, like stacking volume managers, crypto layer, network attached
storage etc.

I think you mean any setup that isn't sensibly layered.


No, I mean any setup that wasn't considered by init system authors.
Your 'sensibly' is not sensible for me.

BCP for over a
decade has been to put multipathing at the bottom, then crypto, then
software RAID, than LVM, and then whatever filesystem you're using.


Really? Let's enumerate some caveats of this:

- crypto below software RAID means double-encryption (wasted CPU),

It also means you leak no information about your storage stack. Ifyou're sufficiently worried about data protection that you're usingblock-level encryption, you should be thinking _very_ hard about whetheror not that's an acceptable risk (and it usually isn't).


- RAID below LVM means you're stuck with the same RAID-profile for all
   the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for
   system and RAID0 for various system caches (like ccache on software
   builder machine) or transient LVM-level snapshots.

Then you skip MD and do the RAID work in LVM with DM-RAID (whichtechnically _is_ MD, just with a different frontend).


- RAID below filesystem means loosing btrfs-RAID extra functionality,
   like recovering data from different mirror when CRC mismatch happens,

That depends on your choice of RAID and the exact configuration of thestorage stack. As long as you expose two RAID devices, BTRFSreplication works just fine on top of them.


- crypto below LVN means encrypting everything, including data that is
   not sensitive - more CPU wasted,

Encrypting only sensitive data is never a good idea unless you can provewith certainty that you will keep it properly segregated, and even thenit's still a pretty bad idea because it makes it obvious exactly wherethe information you consider sensitive is stored.


- RAID below LVM means no way to use SSD acceleration of part of the HDD
   space using MD write-mostly functionality.

Again, just use LVM's DM-RAID and throw in DM-cache. Also, there weresome patches just posted for BTRFS that indirectly allow for this(specifically, they let you change the read-selection algorithm, withthe option of specifying to preferentially read from a specific device).


What you present is only some sane default, which doesn't mean it covers
all the real-world cases.

My recent server is using:
- raw partitioning for base volumes,
- LVM,
- MD on top of some LVs (varying levels),
- paritioned SSD cache attached to specific VGs,
- crypto on top of selected LV/MD,
- btrfs RAID1 on top of non-MDed LVs.

Multipathing has to be the bottom layer for a given node because it
interacts directly with hardware topology which gets obscured by the
other layers.


It is the bottom layer, but I might be attached into volumes at virtually
any place of the logical topology tree. E.g. bare network drive added as
device-mapper mirror target for on-line volume cloning.

And you seriously think that that's going to be a persistent setup?One-shot stuff like that is almost never an issue unless your initsystem is absolutely brain-dead _and_ you need it working as it wasimmediately (and a live-clone of a device doesn't if you're doing it right).

Crypto essentially has to be next, otherwise you leak
info about the storage stack.


I'm encrypting only the containers that require block-level encryption.
Others might have more effective filesystem-level encryption or even be
some TrueCrypt/whatever images.

Again, you're leaking information by doing so. At a minimum, you'releaking info about where the data you consider sensitive is stored, andthat's not counting volume names (exposed by LVM), containerconfiguration (possibly exposed depending on how your container stackhandles it), and other storage stack configuration info (exposed by themetadata of the various layers and possibly by files in /etc if youdon't have your root filesystem encrypted).

Swapping LVM and software RAID ends up
giving you a setup which is difficult for most people to understand and
therefore is hard to reliably maintain.


It's more difficult, as you need to maintain manually two (or more) separate 
VGs with
matching LVs inside. Harder, but more flexible.

And could also be trivially simplified by eliminating MD and using LVM'snative support for DM-RAID, which provides essentially the exact samefunctionality because DM-RAID is largely just a DM fronted for MD.

Other init systems enforce things being this way because it maintains
people's sanity, not because they have significant difficulty doing
things differently (and in fact, it is _trivial_ to change the ordering
in some of them, OpenRC on Gentoo for example quite literally requires
exactly N-1 lines to change in each of N files when re-ordering N
layers), provided each layer occurs exactly once for a given device and
the relative ordering is the same on all devices.  And you know what?


The point is: mainaining all of this logic is NOT the job for init system.
With systemd you need exactly N-N=0 lines of code to make this work.

So, I find it very hard to believe that systemd requires absolutely zeroconfiguration of per-device dependencies. If it really doesn't, thenthat's just more reason I will never use it, as auto-detection opens youup to some quite nasty physical attacks on the system.


The appropriate unit files are provided by MD and LVM upstream.
And they include fallback mechanism for degrading volumes.

Given my own experience with systemd, it has exactly the same constraint
on relative ordering.  I've tried to run split setups with LVM and
dm-crypt where one device had dm-crypt as the bottom layer and the other
had it as the top layer, and things locked up during boot on _every_
generalized init system I tried.


Hard to tell without access to the failing system, but this MIGHT have been:

- old/missing/broken-by-distro-maintainers-who-know-better LVM rules,
- old/bugged systemd, possibly with broken/old cryptsetup rules.

It's quite obvious who's the culprit: every single remaining filesystem
manages to mount under systemd without problems. They just expose
informations about their state.

No, they don't (except ZFS).


They don't expose informations (as there are none), but they DO mount.

There is no 'state' to expose for anything but BTRFS (and ZFS)


Does ZFS expose it's state or not?

Yes, but I'm not quite4 sure exactly how much. I assume it exposesenough to check if datasets can be mounted, but it's also not quite thesame situation as BTRFS, because you can start a ZFS volume with half apool and selectively mount only those datasets that are completelyprovided by the set of devices you do have.

except possibly if the filesystem needs checked or
not.  You're conflating filesystems and volume management.


btrfs is a filesystem, device manager and volume manager.

BTRFS is a filesystem, it does not manage volumes except in the verylimited sense that MD or hardware RAID do, and it does not managedevices (the kernel and udev do so).

I might add DEVICE to a btrfs-thingy.
I might mount the same btrfs-thingy selecting different VOLUME 
(subVOL=something_other)

Except subvolumes aren't really applicable here because they're all ornothing. If you don't have the base filesystem, you don't have anysubvolumes (because what mounting a subvolume actually does is mount theroot of the filesystem, and then bind-mount the subvolume onto thespecified mount-point).

The alternative way of putting what you just said is:
Every single remaining filesystem manages to mount under systemd without
problems, because it doesn't try to treat them as a block layer.


Or: every other volume manager exposes separate block devices.

Anyway - however we put this into words, it is btrfs that behaves differently.

The 'needless complication', as you named it, usually should be the default
to use. Avoiding LVM? Then take care of repartitioning. Avoiding mdadm?
No easy way to RAID the drive (there are device-mapper tricks, they are
just way more complicated). Even attaching SSD cache is not trivial
without preparations (for bcache being the absolutely necessary, much
easier with LVM in place).

For a bog-standard client system, all of those _ARE_ overkill (and
actually, so is BTRFS in many cases too, it's just that we're the only
option for main-line filesystem-level snapshots at the moment).


Such standard systems don't have multidevice btrfs volumes neither, so
they are beyond the problem discussed here.

If btrfs pretends to be device manager it should expose more states,


But it doesn't pretend to.


Why mounting sda2 requires sdb2 in my setup then?

First off, it shouldn't unless you're using a profile that doesn't
tolerate any missing devices and have provided the `degraded` mount
option.  It doesn't in your case because you are using systemd.


I have written this previously (19-22 Dec, "Unexpected raid1 behaviour"):

1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb,
2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan 
tool),
3. try
mount /dev/sda /test - fails
mount /dev/sdb /test - works
4. reboot again and try in reversed order
mount /dev/sdb /test - fails
mount /dev/sda /test - works

mounting btrfs without "btrfs device scan" doesn't work at
all without udev rules (that mimic behaviour of the command).

Actually, try your first mount command above with `-odevice=/dev/sda,device=/dev/sdb` and it will work. You don't needglobal scanning or the udev rules unless you want auto-detection. Thethings is, using this mount option (which effectively triggers the scancode directly on the specified devices as part of the mount call) makesit work in pretty much all init systems except systemd (which stilltries to check with udev regardless).

Second, BTRFS is not a volume manager, it's a filesystem with
multi-device support.


What is the designatum difference between 'volume' and 'subvolume'?This is 
largely orthogonal to my comment above, but:

A volume is an entirely independent data set. So, the following are allvolumes:* A partition on a storage device containing a filesystem that needs noother devices.

* A device-mapper target exposed by LVM.
* A /dev/md* device exposed by MDADM.

* The internal device mapping used by BTRFS (which is not exposed_anywhere_ outside of the given filesystem).

* A ZFS storage pool.

A sub-volume is a BTRFS-specific concept referring to a mostlyindependent filesystem tree within a BTRFS volume that still depends onthe super-blocks, chunk-tree, and a couple of other internal structuresfrom the main filesystem. It's directly equivalent to the ZFS conceptof a dataset, with the caveat that subvolumes are implicitly rooted atpaths within their hierarchy (that is, if you have a subvolume at/something and mount the root subvolume, you will be able to access thecontents of /something as well from that mount), while ZFS datasets arenot (they have to each be explicitly mounted, and the mount hierarchydoesn't have to match the actual dataset hierarchy (but almost alwaysdoes for sanity reasons)). Furthermore, subvolumes are all-or-nothingdependent on the state of the filesystem as a whole (in theory, thiscould be changed, but it would be so invasive to do so that it's likelyto never happen).

The difference is that it's not a block layer,


As a de facto design choice only.

Not really...

ZFS is really the only comparable design to BTRFS out there, and evenlooking at their code it was decidedly non-trivial to implement zvolsand have them play nice with everything else.

despite the fact that systemd is treating it as such.   Yes, BTRFS has
failure modes that result in regular operations being refused based on
what storage devices are present, but so does every single distributed
filesystem in existence, and none of those are volume managers either.


Great example - how is systemd mounting distributed/network filesystems?
Does it mount them blindly, in a loop, or fires some checks against
_plausible_ availability?

Yes, but availability there is a boolean value. In BTRFS it's tri-state(as of right now, possibly four to six states in the future depending onwhat gets merged), and the intermediate (not true or false) state can'tbe checked in a trivial manner.


In other words, is it:
- the systemd that threats btrfs WORSE than distributed filesystems, OR
- btrfs that requires from systemd to be threaded BETTER than other fss?

Or maybe it's both? I'm more than willing to admit that what BTRFS doesexpose currently is crap in terms of usability. The reason it hasn'tchanged is that we (that is, the BTRFS people and the systemd people)can't agree on what it should look like.

There is a term for such situation: broken by design.

So in other words, it's broken by design to try to connect to a remote
host without pinging it first to see if it's online?


Trying to connect to remote host without checking if OUR network is
already up and if the remote target MIGHT be reachable using OUR routes.

systemd checks LOCAL conditions: being online in case of network, being
online in case of hardware, being online in case of virtual devices.

In all of those cases, there is no advantage to trying to figure out if
what you're trying to do is going to work before doing it, because every


...provided there are some measures taken for the premature operation to be
repeated. There is non in btrfs-ecosystem.

Yes, because we expect the user to do so, just like LVM, and MD, andpretty much every other block layer you're claiming we should bebehaving like.

There's a name for the type of design you're saying we should have here,
it's called a time of check time of use (TOCTOU) race condition.  It's
one of the easiest types of race conditions to find, and also one of the
easiest to fix.  Ask any sane programmer, and he will say that _that_ is
broken by design.


Explained before.

And you still blame systemd for using BTRFS_IOC_DEVICES_READY?

Given that it's been proven that it doesn't work and the developers
responsible for it's usage don't want to accept that it doesn't work?  Yes.


Remove it then.

As much as I would love to, we can't because <insert usual stableuserspace API rant from Linus and co. here>.

Just change the BTRFS_IOC_DEVICES_READY handler to always return READY.

Or maybe we should just remove it completely, because checking it _IS
WRONG_,


That's right. But before commiting upstream, check for consequences.
I've already described a few today, pointed the source and gave some
possible alternate solutions.

which is why no other init system does it, and in fact no


Other init systems either fail at mounting degraded btrfs just like
systemd does, or have buggy workarounds in their code reimplemented in
each other just to handle thing, that should be centrally organized.

Really? So the fact that I can mount a 2-device volume with RAID1profiles degraded using OpenRC without needing anything more than addingrootflags=degraded to the kernel parameters must be a fluke then...

The thing is, it primarily breaks if there are hardware issues,regardless of the init system being used, but at least the other initsystems _give you an error message_ (even if it's really the kernelspitting it out) instead of just hanging there forever with noindication of what's going on like systemd does.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: degraded permanent mount option

Reply via email to