On 2018-01-30 10:09, Tomasz Pala wrote:
On Mon, Jan 29, 2018 at 08:42:32 -0500, Austin S. Hemmelgarn wrote:
Yes. They are stupid enough to fail miserably with any more complicated
setups, like stacking volume managers, crypto layer, network attached
storage etc.
I think you mean any setup that isn't sensibly layered.
No, I mean any setup that wasn't considered by init system authors.
Your 'sensibly' is not sensible for me.
BCP for over a
decade has been to put multipathing at the bottom, then crypto, then
software RAID, than LVM, and then whatever filesystem you're using.
Really? Let's enumerate some caveats of this:
- crypto below software RAID means double-encryption (wasted CPU),
It also means you leak no information about your storage stack. If
you're sufficiently worried about data protection that you're using
block-level encryption, you should be thinking _very_ hard about whether
or not that's an acceptable risk (and it usually isn't).
- RAID below LVM means you're stuck with the same RAID-profile for all
the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for
system and RAID0 for various system caches (like ccache on software
builder machine) or transient LVM-level snapshots.
Then you skip MD and do the RAID work in LVM with DM-RAID (which
technically _is_ MD, just with a different frontend).
- RAID below filesystem means loosing btrfs-RAID extra functionality,
like recovering data from different mirror when CRC mismatch happens,
That depends on your choice of RAID and the exact configuration of the
storage stack. As long as you expose two RAID devices, BTRFS
replication works just fine on top of them.
- crypto below LVN means encrypting everything, including data that is
not sensitive - more CPU wasted,
Encrypting only sensitive data is never a good idea unless you can prove
with certainty that you will keep it properly segregated, and even then
it's still a pretty bad idea because it makes it obvious exactly where
the information you consider sensitive is stored.
- RAID below LVM means no way to use SSD acceleration of part of the HDD
space using MD write-mostly functionality.
Again, just use LVM's DM-RAID and throw in DM-cache. Also, there were
some patches just posted for BTRFS that indirectly allow for this
(specifically, they let you change the read-selection algorithm, with
the option of specifying to preferentially read from a specific device).
What you present is only some sane default, which doesn't mean it covers
all the real-world cases.
My recent server is using:
- raw partitioning for base volumes,
- LVM,
- MD on top of some LVs (varying levels),
- paritioned SSD cache attached to specific VGs,
- crypto on top of selected LV/MD,
- btrfs RAID1 on top of non-MDed LVs.
Multipathing has to be the bottom layer for a given node because it
interacts directly with hardware topology which gets obscured by the
other layers.
It is the bottom layer, but I might be attached into volumes at virtually
any place of the logical topology tree. E.g. bare network drive added as
device-mapper mirror target for on-line volume cloning.
And you seriously think that that's going to be a persistent setup?
One-shot stuff like that is almost never an issue unless your init
system is absolutely brain-dead _and_ you need it working as it was
immediately (and a live-clone of a device doesn't if you're doing it right).
Crypto essentially has to be next, otherwise you leak
info about the storage stack.
I'm encrypting only the containers that require block-level encryption.
Others might have more effective filesystem-level encryption or even be
some TrueCrypt/whatever images.
Again, you're leaking information by doing so. At a minimum, you're
leaking info about where the data you consider sensitive is stored, and
that's not counting volume names (exposed by LVM), container
configuration (possibly exposed depending on how your container stack
handles it), and other storage stack configuration info (exposed by the
metadata of the various layers and possibly by files in /etc if you
don't have your root filesystem encrypted).
Swapping LVM and software RAID ends up
giving you a setup which is difficult for most people to understand and
therefore is hard to reliably maintain.
It's more difficult, as you need to maintain manually two (or more) separate
VGs with
matching LVs inside. Harder, but more flexible.
And could also be trivially simplified by eliminating MD and using LVM's
native support for DM-RAID, which provides essentially the exact same
functionality because DM-RAID is largely just a DM fronted for MD.
Other init systems enforce things being this way because it maintains
people's sanity, not because they have significant difficulty doing
things differently (and in fact, it is _trivial_ to change the ordering
in some of them, OpenRC on Gentoo for example quite literally requires
exactly N-1 lines to change in each of N files when re-ordering N
layers), provided each layer occurs exactly once for a given device and
the relative ordering is the same on all devices. And you know what?
The point is: mainaining all of this logic is NOT the job for init system.
With systemd you need exactly N-N=0 lines of code to make this work.
So, I find it very hard to believe that systemd requires absolutely zero
configuration of per-device dependencies. If it really doesn't, then
that's just more reason I will never use it, as auto-detection opens you
up to some quite nasty physical attacks on the system.
The appropriate unit files are provided by MD and LVM upstream.
And they include fallback mechanism for degrading volumes.
Given my own experience with systemd, it has exactly the same constraint
on relative ordering. I've tried to run split setups with LVM and
dm-crypt where one device had dm-crypt as the bottom layer and the other
had it as the top layer, and things locked up during boot on _every_
generalized init system I tried.
Hard to tell without access to the failing system, but this MIGHT have been:
- old/missing/broken-by-distro-maintainers-who-know-better LVM rules,
- old/bugged systemd, possibly with broken/old cryptsetup rules.
It's quite obvious who's the culprit: every single remaining filesystem
manages to mount under systemd without problems. They just expose
informations about their state.
No, they don't (except ZFS).
They don't expose informations (as there are none), but they DO mount.
There is no 'state' to expose for anything but BTRFS (and ZFS)
Does ZFS expose it's state or not?
Yes, but I'm not quite4 sure exactly how much. I assume it exposes
enough to check if datasets can be mounted, but it's also not quite the
same situation as BTRFS, because you can start a ZFS volume with half a
pool and selectively mount only those datasets that are completely
provided by the set of devices you do have.
except possibly if the filesystem needs checked or
not. You're conflating filesystems and volume management.
btrfs is a filesystem, device manager and volume manager.
BTRFS is a filesystem, it does not manage volumes except in the very
limited sense that MD or hardware RAID do, and it does not manage
devices (the kernel and udev do so).
I might add DEVICE to a btrfs-thingy.
I might mount the same btrfs-thingy selecting different VOLUME
(subVOL=something_other)
Except subvolumes aren't really applicable here because they're all or
nothing. If you don't have the base filesystem, you don't have any
subvolumes (because what mounting a subvolume actually does is mount the
root of the filesystem, and then bind-mount the subvolume onto the
specified mount-point).
The alternative way of putting what you just said is:
Every single remaining filesystem manages to mount under systemd without
problems, because it doesn't try to treat them as a block layer.
Or: every other volume manager exposes separate block devices.
Anyway - however we put this into words, it is btrfs that behaves differently.
The 'needless complication', as you named it, usually should be the default
to use. Avoiding LVM? Then take care of repartitioning. Avoiding mdadm?
No easy way to RAID the drive (there are device-mapper tricks, they are
just way more complicated). Even attaching SSD cache is not trivial
without preparations (for bcache being the absolutely necessary, much
easier with LVM in place).
For a bog-standard client system, all of those _ARE_ overkill (and
actually, so is BTRFS in many cases too, it's just that we're the only
option for main-line filesystem-level snapshots at the moment).
Such standard systems don't have multidevice btrfs volumes neither, so
they are beyond the problem discussed here.
If btrfs pretends to be device manager it should expose more states,
But it doesn't pretend to.
Why mounting sda2 requires sdb2 in my setup then?
First off, it shouldn't unless you're using a profile that doesn't
tolerate any missing devices and have provided the `degraded` mount
option. It doesn't in your case because you are using systemd.
I have written this previously (19-22 Dec, "Unexpected raid1 behaviour"):
1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb,
2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan
tool),
3. try
mount /dev/sda /test - fails
mount /dev/sdb /test - works
4. reboot again and try in reversed order
mount /dev/sdb /test - fails
mount /dev/sda /test - works
mounting btrfs without "btrfs device scan" doesn't work at
all without udev rules (that mimic behaviour of the command).
Actually, try your first mount command above with `-o
device=/dev/sda,device=/dev/sdb` and it will work. You don't need
global scanning or the udev rules unless you want auto-detection. The
things is, using this mount option (which effectively triggers the scan
code directly on the specified devices as part of the mount call) makes
it work in pretty much all init systems except systemd (which still
tries to check with udev regardless).
Second, BTRFS is not a volume manager, it's a filesystem with
multi-device support.
What is the designatum difference between 'volume' and 'subvolume'?This is
largely orthogonal to my comment above, but:
A volume is an entirely independent data set. So, the following are all
volumes:
* A partition on a storage device containing a filesystem that needs no
other devices.
* A device-mapper target exposed by LVM.
* A /dev/md* device exposed by MDADM.
* The internal device mapping used by BTRFS (which is not exposed
_anywhere_ outside of the given filesystem).
* A ZFS storage pool.
A sub-volume is a BTRFS-specific concept referring to a mostly
independent filesystem tree within a BTRFS volume that still depends on
the super-blocks, chunk-tree, and a couple of other internal structures
from the main filesystem. It's directly equivalent to the ZFS concept
of a dataset, with the caveat that subvolumes are implicitly rooted at
paths within their hierarchy (that is, if you have a subvolume at
/something and mount the root subvolume, you will be able to access the
contents of /something as well from that mount), while ZFS datasets are
not (they have to each be explicitly mounted, and the mount hierarchy
doesn't have to match the actual dataset hierarchy (but almost always
does for sanity reasons)). Furthermore, subvolumes are all-or-nothing
dependent on the state of the filesystem as a whole (in theory, this
could be changed, but it would be so invasive to do so that it's likely
to never happen).
The difference is that it's not a block layer,
As a de facto design choice only.
Not really...
ZFS is really the only comparable design to BTRFS out there, and even
looking at their code it was decidedly non-trivial to implement zvols
and have them play nice with everything else.
despite the fact that systemd is treating it as such. Yes, BTRFS has
failure modes that result in regular operations being refused based on
what storage devices are present, but so does every single distributed
filesystem in existence, and none of those are volume managers either.
Great example - how is systemd mounting distributed/network filesystems?
Does it mount them blindly, in a loop, or fires some checks against
_plausible_ availability?
Yes, but availability there is a boolean value. In BTRFS it's tri-state
(as of right now, possibly four to six states in the future depending on
what gets merged), and the intermediate (not true or false) state can't
be checked in a trivial manner.
In other words, is it:
- the systemd that threats btrfs WORSE than distributed filesystems, OR
- btrfs that requires from systemd to be threaded BETTER than other fss?
Or maybe it's both? I'm more than willing to admit that what BTRFS does
expose currently is crap in terms of usability. The reason it hasn't
changed is that we (that is, the BTRFS people and the systemd people)
can't agree on what it should look like.
There is a term for such situation: broken by design.
So in other words, it's broken by design to try to connect to a remote
host without pinging it first to see if it's online?
Trying to connect to remote host without checking if OUR network is
already up and if the remote target MIGHT be reachable using OUR routes.
systemd checks LOCAL conditions: being online in case of network, being
online in case of hardware, being online in case of virtual devices.
In all of those cases, there is no advantage to trying to figure out if
what you're trying to do is going to work before doing it, because every
...provided there are some measures taken for the premature operation to be
repeated. There is non in btrfs-ecosystem.
Yes, because we expect the user to do so, just like LVM, and MD, and
pretty much every other block layer you're claiming we should be
behaving like.
There's a name for the type of design you're saying we should have here,
it's called a time of check time of use (TOCTOU) race condition. It's
one of the easiest types of race conditions to find, and also one of the
easiest to fix. Ask any sane programmer, and he will say that _that_ is
broken by design.
Explained before.
And you still blame systemd for using BTRFS_IOC_DEVICES_READY?
Given that it's been proven that it doesn't work and the developers
responsible for it's usage don't want to accept that it doesn't work? Yes.
Remove it then.
As much as I would love to, we can't because <insert usual stable
userspace API rant from Linus and co. here>.
Just change the BTRFS_IOC_DEVICES_READY handler to always return READY.
Or maybe we should just remove it completely, because checking it _IS
WRONG_,
That's right. But before commiting upstream, check for consequences.
I've already described a few today, pointed the source and gave some
possible alternate solutions.
which is why no other init system does it, and in fact no
Other init systems either fail at mounting degraded btrfs just like
systemd does, or have buggy workarounds in their code reimplemented in
each other just to handle thing, that should be centrally organized.
Really? So the fact that I can mount a 2-device volume with RAID1
profiles degraded using OpenRC without needing anything more than adding
rootflags=degraded to the kernel parameters must be a fluke then...
The thing is, it primarily breaks if there are hardware issues,
regardless of the init system being used, but at least the other init
systems _give you an error message_ (even if it's really the kernel
spitting it out) instead of just hanging there forever with no
indication of what's going on like systemd does.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html