Re: degraded permanent mount option

Tomasz Pala Sat, 27 Jan 2018 16:01:07 -0800

On Sat, Jan 27, 2018 at 13:57:29 -0700, Chris Murphy wrote:

> The Btrfs systemd udev rule is a sledghammer because it has no
> timeout. It neither times out and tries to mount anyway, nor does it
> time out and just drop to a dracut prompt. There are a number of
> things in systemd startups that have timeouts, I have no idea how they
> get defined, but that single thing would make this a lot better. Right
> now the Btrfs udev rule means if all devices aren't available, hang
> indefinitely.


You mix udev with systemd:
- udev doesn't wait for anything - it REACTS to events. Blame the part
  that doesn't emit a one that you want.
- systemd WAITS for device to appear AND SETTLE. The timeout for devices
  is 90 seconds by default and can be changed in fstab with
  x-systemd.device-timeout.

It cannot "just try-mount this-or-that ON boot" as there is NO state of
"booting" or "shutting down" in systemd flow, there is only a chain of
events.

Mounting a device happens when it becomes available. If the device is
not crucial for booting up, just add the "nofail". If you want to lie
about the device being available, put the appropriate code into device
handler (btrfs could parse kernel cmdline and return READY for degraded).

> this rule can have a timer. Service units absolutely can have timers,
> so maybe there's a way to marry a udev rule with a service which has a
> timer. The absolute dumbest thing that's better than now, is at the
> timer just fail and drop to a dracut prompt. Better would be to try a
> normal mount anyway, which also fails to a dracut prompt, but
> additionally gives us a kernel error for Btrfs (the missing device
> open ctree error you'd expect to get when mounting without -o degraded
> when you're missing a device). And even better would be a way for the
> user to edit the service unit to indicate "upon timeout being reached,
> use mount -o degraded rather than just mount". This is the simplest of
> Boolean logic, so I'd be surprised if systemd doesn't offer a way for
> us to do exactly what I'm describing.

Any fallback to degraded mode requires the volume manager to handle this
gracefuly first. Until btrfs is degraded-safe, systemd cannot offer to
mount it degraded in _any_ way.

> Again the central problem is the udev rule now means "wait for device
> to appear" with no timed fallback.

If I got photos on external drive, is this also the udev rule that waits
for me to plug it in?

You should blame btrfs.ko for not "plugging in" (emiting event to udev).

> The mdadm case has this, and it's done by dracut. At this same stage
> of startup with a  missing device, there is in fact no fs colume UUID
> yet because the array hasn't started. Dracut+mdadm knows there's a
> missing device so it's just iterating: look, sleep 3, look, sleep 3,
> look, sleep 3. It's on a loop. And after that loop hits something like
> 100, the script says f it, start array anyway, so now there is a

So you need some btrfsd to iterate and eventually say "go degraded",
as systemd isn't the volume manager here.

> degraded array, and for the first time the fs volume UUID appears, and
> systemd goes "ahaha! mount that!" and it does it normally.

You see for yourself, systemd mounts ready md device. It doesn't handle
the 'ready-or-not-yet' timing out logic.

> So the timer and timeout and what happens at the timeout is defined by
> dracut. That's probably why the systemd folks say "not our problem"
> and why the kernel folks say "not our problem".

And they are both right - there should be btrfsd for handling this.
Well, except for the kernel cmdline that should be parsed by kernel
guys. But since it is impossible to assemble multidevice btrfs by kernel
itself anyway (i.e. without initrd) this could all go to the daemon.

>> If btrfs pretends to be device manager it should expose more states,
>> especially "ready to be mounted, but not fully populated" (i.e.
>> "degraded mount possible"). Then systemd could _fallback_ after timing
>> out to degraded mount automatically according to some systemd-level
>> option.
> 
> No, mdadm is a device manager and it has no such facility. Something

It has - the ultimate "signalling" in case of mdadm is apperance of
/dev/mdX device. Until the device won't came up, the systemd obviously
won't mount it.
In case of btrfs the situation is abnormal - there IS /dev/sda1 device
available, but in fact it might be not available. So there is the IOCTL
to check if the available device is really available. And guess what -
it returns NOT_READY... And you want systemd to mount this
ready-not_ready? The same way you could ask systemd to mount MD/LVM
device that is not existing.

This is btrfs fault to:
1. reuse device node for different purposes [*],
2. lack the timing-out/degraded logic implemented somewhere.

> issues a command to start the array anyway, and only then do you find
> out if there are enough devices to start it. I don't understand the
> value of knowing whether it is possible. Just try to mount it degraded
> and then if it fails we fail, nothing can be done automatically it's
> up to an admin.

It can't mount degraded, because the "missing" device might go online a
few seconds ago.

> And even if you had this "degraded mount possible" state, you still
> need a timer. So just build the timer.

Exactly! This "timer" is btrfs-specific daemon that should be shipped
with btrfs-tools. Well, maybe not the actual daemon, as btrfs handles
incremental assembly on it's own, just the appropriate units and
signalling.

For mdadm there is --incremental used for gradually assemble via udev rule:
https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/udev-md-raid-assembly.rules
(this also fires timer)

and the systemd part for timing-out and degraded fallback:
https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/systemd
mdadm-last-resort@.timer
mdadm-last-resort@.service

There is appropriate code in LVM as well, using lvmetad, but this one is easier.

So, let's step by step your proposal:

> If all devices ready ioctl is true, the timer doesn't start, it means
> all devices are available, mount normally.

sure

> If all devices ready ioctl is false, the timer starts, if all devices
> appear later the ioctl goes to true, the timer is belayed, mount
> normally.

sure

> If all devices ready ioctl is false, the timer starts, when the timer
> times out, mount normally which fails and gives us a shell to
> troubleshoot at.
> OR
> If all devices ready ioctl is false, the timer starts, when the timer
> times out, mount with -o degraded which either succeeds and we boot or
> it fails and we have a troubleshooting shell.

Don't mix layers - just image your /dev/sda1 is not there and you simply
cannot even try to mount it; this should be done like this:

If all devices ready ioctl is false, the timer starts, when the timer
times out, TELL THE KERNEL THAT WE WANT DEGRADED MOUNT. This in turn
should switch IOCTL response to "OK, go degraded" which in turn would
make udev rule to raise the flag[*] and then systemd could mount this.

This is important that the kernel would be instructed in a way, not the
last one, as this gives the chance to pass the degraded option using the
cmdline.

> The central problem is the lack of a timer and time out.

You got mdadm-last-resort@.timer/service above, if btrfs doesn't lack
anything, as you all state here, this should be easy to make this work.
Go ahead please.

>> Unless there is *some* signalling from btrfs, there is really not much
>> systemd can *safely* do.
> 
> That is not true. It's not how mdadm works anyway.

Yes it does. You can't mount mdadm until /dev/mdX appears, which happens
when array get's fully assembled *OR* times out and kernel get's
instructed to run array as degraded, which effects in /dev/mdX appearing.
There is NO additional logic in systemd.

This is NOT systemd that assembles degraded mdadm, this is mdadm that
tells the kernel to assemble it and systemd mounts READY md.
Moreover, systemd gives you a set of tools that you can use for timers.

[*] the udev flag is required to distinguish /dev/sda1 block device from
/dev/sda1 btrfs-volume device being ready. If there were separate device
created, there would be no need for this entire IOCTL.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: degraded permanent mount option

Reply via email to