Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Thu, Dec 30, 2021 at 3:59 PM Chris Murphy wrote: > > ZFS uses volume and user properties which we could probably mimic with > xattr. I thought I asked about xattr instead of subvolume names at one > point in the thread but I don't see it. So instead of using subvolume > names, what about stuffing this information in xattr? My gut instinct > is this is less transparent and user friendly, it requires more tools > to know how to user to troubleshoot and fix, etc. Separate from whether the obscurity of an xattr is a good idea or not, read-only snapshots can't have xattr added, removed, or modified. We can rename read-only snapshots, however. While we could unset the ro property, that also wipes received UUID used by send/receive. And while we could make an rw snapshot of the ro snapshot, modify the xattr, then make an ro snapshot of the rw snapshot, this alters the parent UUI also used by send/receive. So it complicates send/receive workflows as a potential update mechanism, or for backup/restore, for anything that tracks these UUIDs, e.g. btrbk. -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
(I'm sorta not doing a great job of using "sub-volume" to mean generically any of Btrfs subvolume or a directory or a logical volume, so hopefully anyone still following can make the leap that I don't intend this spec to be Btrfs specific. I like it being general purpose.) On Tue, Dec 21, 2021 at 6:57 AM Ludwig Nussel wrote: > > The way btrfs is used in openSUSE is based on systems from ten years > ago. A lot has changed since then. Now with the idea to have /usr on a > separate read-only subvolume the current model doesn't really work very > well anymore IMO. So I think there's a window of opportunity to change > the way openSUSE does things :-) ZFS uses volume and user properties which we could probably mimic with xattr. I thought I asked about xattr instead of subvolume names at one point in the thread but I don't see it. So instead of using subvolume names, what about stuffing this information in xattr? My gut instinct is this is less transparent and user friendly, it requires more tools to know how to user to troubleshoot and fix, etc. -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Tue, Dec 21, 2021 at 6:57 AM Ludwig Nussel wrote: > > Chris Murphy wrote: > > The part I'm having a hard time separating is the implicit case (use > > some logic to assemble the correct objects), versus explicit (the > > bootloader snippet points to a root and the root contains an fstab - > > nothing about assembly is assumed). And should both paradigms exist > > concurrently in an installed system, and how to deconflict? > > Not sure there is a conflict. The discovery logic is well defined after > all. Also I assume normal operation wouldn't mix the two. Package > management or whatever installs updates would automatically do the right > thing suitable for the system at hand. rootflags=subvol/subvolid= should override the discoverable sub-volumes generator I don't expect rootflags is normally used in a discoverable sub-volumes workflow, but if the user were to add it for some reason, we'd want it to be favored. > > > Further, (open)SUSE tends to define the root to boot via `btrfs > > subvolume set-default` which is information in the file system itself, > > neither in the bootloader snipper nor in the naming convention. It's > > neat, but also not discoverable. If users are trying to > > The way btrfs is used in openSUSE is based on systems from ten years > ago. A lot has changed since then. Now with the idea to have /usr on a > separate read-only subvolume the current model doesn't really work very > well anymore IMO. So I think there's a window of opportunity to change > the way openSUSE does things :-) I think the transactional model can accommodate better anyway, and is the direction I'd like to go in with Fedora. Make updates/upgrades happen out of band (in a container on a snapshot). We can apply resource control limits so that the upgrade process doesn't negatively impact the user's higher priority workload. If the update fails to complete or fails a set of simple tests - the snapshot is simply discarded. No harm done to the running system. If it passes checks, then its name is changed to indicate it's the favored "next root" following reboot. And we don't have to keep a database to snapshot, assemble, and discard things, it can all be done by naming scheme. I think the naming scheme should include some sort of "in-progress" tag, so it's discoverable such a sub-volume is (a) not active (b) in some state of flux that potentially was interrupted (c) isn't critical to the system. Such a sub-volume should either be destroyed (failed update) or renamed (update succeeds). If the owning process were to fail (crash, powerfailure), the next time it runs to check for updates, it would discover this "in-progress" sub-volume and remove it (assume it's in a failed state). -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
Chris Murphy wrote: > On Tue, Nov 9, 2021 at 8:48 AM Ludwig Nussel wrote: >> Lennart Poettering wrote: >>> Or to say this explicitly: we could define the spec to say that if >>> we encounter: >>> >>>/@auto/root-x86-64:fedora_36.0+3-0 >>> >>> on first boot attempt we'd rename it: >>> >>>/@auto/root-x86-64:fedora_36.0+2-1 >>> >>> and so on. Until boot succeeds in which case we'd rename it: >>> >>>/@auto/root-x86-64:fedora_36.0 >>> >>> i.e. we'd drop the counting suffix. >> >> Thanks for the explanation and pointer! >> >> Need to think aloud a bit :-) >> >> That method basically works for systems with read-only root. Ie where >> the next OS to boot is in a separate snapshot, eg MicroOS. >> A traditional system with rw / on btrfs would stay on the same subvolume >> though. Ie the "root-x86-64:fedora_36.0" volume in the example. In >> openSUSE package installation automatically leads to ro snapshot >> creation. In order to fit in I suppose those could then be named eg. >> "root-x86-64:fedora_36.N+0" with increasing N. Due to the +0 the >> subvolume would never be booted. > > Yeah the N+0 subvolumes could be read-only snapshots, their purpose is > only to be used as an immutable checkpoint from which to produce > derivatives, read-write subvolumes. But what about the case of being > in a preboot environment, and have no way (yet) to rename or create a > new snapshot to boot, and you need to boot one of these read-only > snapshots? What if the bootloader was smart enough to add the proper > volatile overlay arrangement anytime an N+0 subvolume is chosen for > boot? Is that plausible and useful? The initrd would have to make those arrangements. AFAICT so far openSUSE systems just boot into such a RO environment without any preparations. So fully read-only, just enough to run snapper to create a usable snapshot again. >> Anyway, let's assume the ro case and both efi partition and btrfs volume >> use this scheme. That means each time some packages are updated we get a >> new subvolume. After reboot the initrd in the efi partition would try to >> boot that new subvolume. If it reaches systemd-bless-boot.service the >> new subvolume becomes the default for the future. >> >> So far so good. What if I discover later that something went wrong >> though? Some convenience tooling to mark the current version bad again >> would be needed. >> >> But then having Tumbleweed in mind it needs some capability to boot any >> old snapshot anyway. I guess the solution here would be to just always >> generate a bootloader entry, independent of whether a kernel was >> included in an update. Each entry would then have to specify kernel, >> initrd and the root subvolume to use. > > The part I'm having a hard time separating is the implicit case (use > some logic to assemble the correct objects), versus explicit (the > bootloader snippet points to a root and the root contains an fstab - > nothing about assembly is assumed). And should both paradigms exist > concurrently in an installed system, and how to deconflict? Not sure there is a conflict. The discovery logic is well defined after all. Also I assume normal operation wouldn't mix the two. Package management or whatever installs updates would automatically do the right thing suitable for the system at hand. > Further, (open)SUSE tends to define the root to boot via `btrfs > subvolume set-default` which is information in the file system itself, > neither in the bootloader snipper nor in the naming convention. It's > neat, but also not discoverable. If users are trying to The way btrfs is used in openSUSE is based on systems from ten years ago. A lot has changed since then. Now with the idea to have /usr on a separate read-only subvolume the current model doesn't really work very well anymore IMO. So I think there's a window of opportunity to change the way openSUSE does things :-) cu Ludwig -- (o_ Ludwig Nussel //\ V_/_ http://www.suse.com/ SUSE Software Solutions Germany GmbH, GF: Ivo Totev HRB 36809 (AG Nürnberg)
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Fr, 10.12.21 12:25, Chris Murphy (li...@colorremedies.com) wrote: > On Thu, Nov 11, 2021 at 12:28 PM Lennart Poettering > wrote: > > > That said: naked squashfs sucks. Always wrap your squashfs in a GPT > > wrapper to make things self-descriptive. > > Do you mean the image file contains a GPT, and the squashfs is a > partition within the image? Does this recommendation apply to any > image? Let's say it's a Btrfs image. And in the context of this > thread, the GPT partition type GUID would be the "super-root" GUID? Yes, I'd always add a GPT wrapper around disk images. It's simple, extensible and first and foremost self-descriptive: you know what you are looking at, safely, before parsing the fs. It opens the door for adding verity data in a very natural way, and more. Lennart -- Lennart Poettering, Berlin
[systemd-devel] Antw: Re: Antw: [EXT] Re: [systemd‑devel] the need for a discoverable sub‑volumes specification
>>> Chris Murphy schrieb am 10.12.2021 um 16:59 in Nachricht : > On Mon, Nov 22, 2021 at 3:02 AM Ulrich Windl > wrote: >> >> >>> Lennart Poettering schrieb am 19.11.2021 um >> >>> 10:17 >> in >> Nachricht : >> > On Do, 18.11.21 14:51, Chris Murphy (li...@colorremedies.com) wrote: >> > >> >> How to do swapfiles? >> > >> > Is this really a concept that deserves too much attention? I mean, I >> > have the suspicion that half the benefit of swap space is that it can >> > act as backing store for hibernation. But swap files are icky for that >> > since that means the resume code has to mount the fs first, but given >> > the fs is dirty during the hibernation state this is highly problematic. >> > >> > Hence, I have the suspicion that if you do swap you should probably do >> > swap partitions, not swap files, because it can cover all usecase: >> > paging *and* hibernation. >> >> Out of curiosity: What about swap LVs, possibly thin-provisioned ones? > > I don't think that's supported. > https://listman.redhat.com/archives/linux-lvm/2020-November/msg00039.html > AFAIK Redhat-based Qubes OS uses it; that's where I first saw it. > > -- > Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Thu, Nov 4, 2021 at 9:39 AM Lennart Poettering wrote: > 3. Inside the "@auto" dir of the "super-root" fs, have dirs named >[:]. The type should have a similar vocubulary >as the GPT spec type UUIDs, but probably use textual identifiers >rater than UUIDs, simply because naming dirs by uuids is >weird. Examples: > >/@auto/root-x86-64:fedora_36.0/ >/@auto/root-x86-64:fedora_36.1/ >/@auto/root-x86-64:fedora_37.1/ >/@auto/home/ >/@auto/srv/ >/@auto/tmp/ > >Which would be assembled by the initrd into the following via bind >mounts: > >/ → /@auto/root-x86-64:fedora_37.1/ >/home/→ /@auto/home/ >/srv/ → /@auto/srv/ >/var/tmp/ → /@auto/tmp/ What about arbitrary mountpoints and their subvolumes? Things we can't predict in advance for all use cases? For example: For my non-emphemeral systems: * /var/log is a directory contained in subvolume "varlog-x86-64:fedora.35" * /var/lib/libvirt/images is a directory contained in subvolume "varlibvirtimages-x86-64:fedora.35" * /var/lib/flatpak is a directory contained in a subvolume "varlibflatpak-x86-64:any" - as it isn't Fedora specific, uses its own versioning so in this case I'd expect it gets mounted with any distribution. These exist so they are excluded from a snapshot and rollback regime that applies to "root-x86-64:fedora.35" which contains usr/ var/ etc/ A rollback of root does not rollback the systemd journal, VM images, or flatpaks. Is space a valid separator in the name of the subvolume? Or underscore? This would become / to define the path to the mount point. Additionally, I'm noticing that none of 'journalctl -o verbose' or json or export shows what subvolume was mounted at each mount point. I need to use systemd debug for this information to be included in the journal. Assembly of versioned roots is probably useful logging information by default. e.g. Dec 10 10:45:00 fovo.local systemd[1]: Mounting '@auto/root-x86-64:fedora.35' at /sysroot... Dec 10 10:45:11 fovo.local systemd[1]: Mounting '@auto/home' at /home... Dec 10 10:45:11 fovo.local systemd[1]: Mounting '@auto/varlibflatpak' at /var/lib/flatpak... Dec 10 10:45:11 fovo.local systemd[1]: Mounting '@auto/varlibvirtimages-x86-64:fedora.35 at /var/lib/libvirt/images... Dec 10 10:45:11 fovo.local systemd[1]: Mounting '@auto/varlog-x86-64:fedora.35' at /var/log... Dec 10 10:45:11 fovo.local systemd[1]: Mounting '@auto/swap' at /var/swap... -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Thu, Nov 11, 2021 at 12:28 PM Lennart Poettering wrote: > That said: naked squashfs sucks. Always wrap your squashfs in a GPT > wrapper to make things self-descriptive. Do you mean the image file contains a GPT, and the squashfs is a partition within the image? Does this recommendation apply to any image? Let's say it's a Btrfs image. And in the context of this thread, the GPT partition type GUID would be the "super-root" GUID? -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Tue, Nov 9, 2021 at 8:48 AM Ludwig Nussel wrote: > > Lennart Poettering wrote: > > Or to say this explicitly: we could define the spec to say that if > > we encounter: > > > >/@auto/root-x86-64:fedora_36.0+3-0 > > > > on first boot attempt we'd rename it: > > > >/@auto/root-x86-64:fedora_36.0+2-1 > > > > and so on. Until boot succeeds in which case we'd rename it: > > > >/@auto/root-x86-64:fedora_36.0 > > > > i.e. we'd drop the counting suffix. > > Thanks for the explanation and pointer! > > Need to think aloud a bit :-) > > That method basically works for systems with read-only root. Ie where > the next OS to boot is in a separate snapshot, eg MicroOS. > A traditional system with rw / on btrfs would stay on the same subvolume > though. Ie the "root-x86-64:fedora_36.0" volume in the example. In > openSUSE package installation automatically leads to ro snapshot > creation. In order to fit in I suppose those could then be named eg. > "root-x86-64:fedora_36.N+0" with increasing N. Due to the +0 the > subvolume would never be booted. Yeah the N+0 subvolumes could be read-only snapshots, their purpose is only to be used as an immutable checkpoint from which to produce derivatives, read-write subvolumes. But what about the case of being in a preboot environment, and have no way (yet) to rename or create a new snapshot to boot, and you need to boot one of these read-only snapshots? What if the bootloader was smart enough to add the proper volatile overlay arrangement anytime an N+0 subvolume is chosen for boot? Is that plausible and useful? > Anyway, let's assume the ro case and both efi partition and btrfs volume > use this scheme. That means each time some packages are updated we get a > new subvolume. After reboot the initrd in the efi partition would try to > boot that new subvolume. If it reaches systemd-bless-boot.service the > new subvolume becomes the default for the future. > > So far so good. What if I discover later that something went wrong > though? Some convenience tooling to mark the current version bad again > would be needed. > > But then having Tumbleweed in mind it needs some capability to boot any > old snapshot anyway. I guess the solution here would be to just always > generate a bootloader entry, independent of whether a kernel was > included in an update. Each entry would then have to specify kernel, > initrd and the root subvolume to use. The part I'm having a hard time separating is the implicit case (use some logic to assemble the correct objects), versus explicit (the bootloader snippet points to a root and the root contains an fstab - nothing about assembly is assumed). And should both paradigms exist concurrently in an installed system, and how to deconflict? Further, (open)SUSE tends to define the root to boot via `btrfs subvolume set-default` which is information in the file system itself, neither in the bootloader snipper nor in the naming convention. It's neat, but also not discoverable. If users are trying to learn+understand+troubleshoot how systems boot and assemble themselves, to what degree are they owed transparency without needing extra tools or decoder rings to reveal settings? The default subvolume is uniquely btrfs, and without an equivalent anywhere else (so far as I'm aware) I'm reluctant to use that for day to day boots. I can see the advantage of this for btrfs for some sort of rescue/emergency boot subvolume however... where it doesn't contain the parameter "rootflags=subvol=$root" (which acts as an override for the default subvolume set in the fs itself) then the btrfs default subvolume would be used. I'm struggling with its role in all of this though. -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Fri, Nov 19, 2021 at 4:17 AM Lennart Poettering wrote: > > On Do, 18.11.21 14:51, Chris Murphy (li...@colorremedies.com) wrote: > > > How to do swapfiles? > > Is this really a concept that deserves too much attention? *shrug* Only insofar as I like order, and like the idea of agreeing on where things belong if there's going to appear somewhere. > I mean, I > have the suspicion that half the benefit of swap space is that it can > act as backing store for hibernation. Yes and that's a terrible conflation. The swapfile/device is for anonymous pages. And hibernation images are not anon pages, and even have special rules like must be contained in contiguous physical device blocks. It may turn out that 'swsusp' (Swap Suspend) in the kernel shouldn't be deprecated, and instead focus future effort on 'uswsusp'. But discussions around signed and authenticated hibernation images for UEFI Secure Boot and kernel lockdown compatibility, have all been around the kernel implementation. https://www.kernel.org/doc/Documentation/power/swsusp.rst https://www.kernel.org/doc/Documentation/power/userland-swsusp.rst > But swap files are icky for that > since that means the resume code has to mount the fs first, but given > the fs is dirty during the hibernation state this is highly problematic. It's sufficiently complicated and non-fail-safe (it's fail danger) that it's broken. On btrfs, it's more tedious but less broken because you must use both resume=UUID=$uuid resume_offset=$physicaloffsethibernationimage In effect the kernel does not need to mount ro the btrfs file system at all, it gets the hint for the physical location of the hibernation image from kernel boot parameter. Other file systems support discovery of the physical offset once the file system is mounted ro. On Btrfs you can see the swapfile as having a punch through mechanism. It's a reservation of blocks, and page outs happen directly to that reservation of blocks, not via the file system itself. This is why there are all these limitations: balance doesn't touch block groups containing any swapfile blocks, you can't do any kind of multiple device stuff, you can't snapshot/reflink the swapfile, etc. Which is why I'm in favor of just ceding this entire territory over to systemd to manage correctly. But as a prerequisite, the hibernation image should be separate from the swapfile. And should have a metadata format so we can pair file system state to hibernation image state, that way for sure we aren't running into catastrophic nonsense like this right at the top of https://www.kernel.org/doc/Documentation/power/swsusp.rst **BIG FAT WARNING** If you touch anything on disk between suspend and resume... ...kiss your data goodbye. If you do resume from initrd after your filesystems are mounted... ...bye bye root partition. Horrible. > Hence, I have the suspicion that if you do swap you should probably do > swap partitions, not swap files, because it can cover all usecase: > paging *and* hibernation. I agree only insofar as it's the most reliable thing we have right now. Not that it's an efficient or safe design, you still can have problems if you rw mount a file system, and then resume from a hibernation image. The kernel has no concept of matching a file system state to that of a hibernation image, so that the hibernation image can be invalidated, thus avoiding subsequent corruption. > > Currently I'm creating a "swap" subvolume in the top-level of the file > > system and /etc/fstab looks like this > > > > UUID=$FSUUID/var/swap btrfs noatime,subvol=swap 0 0 > > /var/swap/swapfile1 none swap defaults 0 0 > > > > This seems to work reliably after hundreds of boots. > > > > a. Is this naming convention for the subvolume adequate? Seems like it > > can just be "swap" because the GPT method is just a single partition > > type GUID that's shared by multiboot Linux setups, i.e. not arch or > > distro specific > > I'd still put it one level down, and marke it with some non-typical > character so that it is less likely to clash with anything else. I'm not sure I understand "one level down". The "swap" subvolume would be in the top-level of the Btrfs file system, just like Fedora's existing "root" and "home" subvolumes are in the top level. > > > b. Is the mount point, /var/swap, OK? > > I see no reason why not. OK super. > > > c. What should the additional naming convention be for the swapfile > > itself so swapon happens automatically? > > To me it appears these things should be distinct: if automatic > activation of swap files is desirable, then there should probably be a > systemd generator that finds all suitable files in /var/swap/ and > generates .swap units for them. This would then work with any kind of > setup, i.e. independently of the btrfs auto-discovery stuff. The other > thing would be the btrfs auto-disocvery to then actually mount > something there automatically. I think it's desirable only because users
Re: [systemd-devel] Antw: [EXT] Re: [systemd‑devel] the need for a discoverable sub‑volumes specification
On Mon, Nov 22, 2021 at 3:02 AM Ulrich Windl wrote: > > >>> Lennart Poettering schrieb am 19.11.2021 um 10:17 > in > Nachricht : > > On Do, 18.11.21 14:51, Chris Murphy (li...@colorremedies.com) wrote: > > > >> How to do swapfiles? > > > > Is this really a concept that deserves too much attention? I mean, I > > have the suspicion that half the benefit of swap space is that it can > > act as backing store for hibernation. But swap files are icky for that > > since that means the resume code has to mount the fs first, but given > > the fs is dirty during the hibernation state this is highly problematic. > > > > Hence, I have the suspicion that if you do swap you should probably do > > swap partitions, not swap files, because it can cover all usecase: > > paging *and* hibernation. > > Out of curiosity: What about swap LVs, possibly thin-provisioned ones? I don't think that's supported. https://listman.redhat.com/archives/linux-lvm/2020-November/msg00039.html -- Chris Murphy
[systemd-devel] Antw: [EXT] Re: [systemd‑devel] the need for a discoverable sub‑volumes specification
>>> Lennart Poettering schrieb am 19.11.2021 um 10:17 in Nachricht : > On Do, 18.11.21 14:51, Chris Murphy (li...@colorremedies.com) wrote: > >> How to do swapfiles? > > Is this really a concept that deserves too much attention? I mean, I > have the suspicion that half the benefit of swap space is that it can > act as backing store for hibernation. But swap files are icky for that > since that means the resume code has to mount the fs first, but given > the fs is dirty during the hibernation state this is highly problematic. > > Hence, I have the suspicion that if you do swap you should probably do > swap partitions, not swap files, because it can cover all usecase: > paging *and* hibernation. Out of curiosity: What about swap LVs, possibly thin-provisioned ones? > >> Currently I'm creating a "swap" subvolume in the top‑level of the file >> system and /etc/fstab looks like this >> >> UUID=$FSUUID/var/swap btrfs noatime,subvol=swap 0 0 >> /var/swap/swapfile1 none swap defaults 0 0 >> >> This seems to work reliably after hundreds of boots. >> >> a. Is this naming convention for the subvolume adequate? Seems like it >> can just be "swap" because the GPT method is just a single partition >> type GUID that's shared by multiboot Linux setups, i.e. not arch or >> distro specific > > I'd still put it one level down, and marke it with some non‑typical > character so that it is less likely to clash with anything else. > >> b. Is the mount point, /var/swap, OK? > > I see no reason why not. > >> c. What should the additional naming convention be for the swapfile >> itself so swapon happens automatically? > > To me it appears these things should be distinct: if automatic > activation of swap files is desirable, then there should probably be a > systemd generator that finds all suitable files in /var/swap/ and > generates .swap units for them. This would then work with any kind of > setup, i.e. independently of the btrfs auto‑discovery stuff. The other > thing would be the btrfs auto‑disocvery to then actually mount > something there automatically. > >> Also, instead of /@auto/ I'm wondering if we could have >> /x‑systemd.auto/ ? This makes it more clearly systemd's namespace, and >> while I'm a big fan of the @ symbol for typographic history reasons, >> it's being used in the subvolume/snapshot regimes rather haphazardly >> for different purposes which might be confusing? e.g. Timeshift >> expects subvolumes it manages to be prefixed with @. Meanwhile SUSE >> uses @ for its (visible) root subvolume in which everything else goes. >> And still ZFS uses @ for their (read‑only) snapshots. > > I try to keep the "systemd" name out of entirely generic specs, since > there are some people who have an issue with that. i.e. this way we > tricked even Devuan to adopt /etc/os‑release and the /run/ hierarchy, > since they probably aren't even aware that these are systemd things. > > Other chars could be used too: /+auto/ sounds OK to me too. or > /_auto/, or /=auto/ or so. > > Lennart > > ‑‑ > Lennart Poettering, Berlin
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Do, 18.11.21 15:01, Chris Murphy (li...@colorremedies.com) wrote: > On Thu, Nov 18, 2021 at 2:51 PM Chris Murphy wrote: > > > > How to do swapfiles? > > > > Currently I'm creating a "swap" subvolume in the top-level of the file > > system and /etc/fstab looks like this > > > > UUID=$FSUUID/var/swap btrfs noatime,subvol=swap 0 0 > > /var/swap/swapfile1 none swap defaults 0 0 > > > > This seems to work reliably after hundreds of boots. > > > > a. Is this naming convention for the subvolume adequate? Seems like it > > can just be "swap" because the GPT method is just a single partition > > type GUID that's shared by multiboot Linux setups, i.e. not arch or > > distro specific > > b. Is the mount point, /var/swap, OK? > > c. What should the additional naming convention be for the swapfile > > itself so swapon happens automatically? > > Actually I'm thinking of something different suddenly... because > without user ownership of swapfiles, and instead systemd having domain > over this, it's perhaps more like: > > /x-systemd.auto/swap -> /run/systemd/swap I'd be conservative with mounting disk stuff to /run/. We do this for removable disks because the mount points are kinda dynamic, hence it makes sense, but for this case it sounds unnecessary, /var/swap sounds fine to me, in particular as the /var/ partition actually sounds like the right place to it if /var/swap/ is not a mount point in itself but just a plain subdir. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Do, 18.11.21 14:51, Chris Murphy (li...@colorremedies.com) wrote: > How to do swapfiles? Is this really a concept that deserves too much attention? I mean, I have the suspicion that half the benefit of swap space is that it can act as backing store for hibernation. But swap files are icky for that since that means the resume code has to mount the fs first, but given the fs is dirty during the hibernation state this is highly problematic. Hence, I have the suspicion that if you do swap you should probably do swap partitions, not swap files, because it can cover all usecase: paging *and* hibernation. > Currently I'm creating a "swap" subvolume in the top-level of the file > system and /etc/fstab looks like this > > UUID=$FSUUID/var/swap btrfs noatime,subvol=swap 0 0 > /var/swap/swapfile1 none swap defaults 0 0 > > This seems to work reliably after hundreds of boots. > > a. Is this naming convention for the subvolume adequate? Seems like it > can just be "swap" because the GPT method is just a single partition > type GUID that's shared by multiboot Linux setups, i.e. not arch or > distro specific I'd still put it one level down, and marke it with some non-typical character so that it is less likely to clash with anything else. > b. Is the mount point, /var/swap, OK? I see no reason why not. > c. What should the additional naming convention be for the swapfile > itself so swapon happens automatically? To me it appears these things should be distinct: if automatic activation of swap files is desirable, then there should probably be a systemd generator that finds all suitable files in /var/swap/ and generates .swap units for them. This would then work with any kind of setup, i.e. independently of the btrfs auto-discovery stuff. The other thing would be the btrfs auto-disocvery to then actually mount something there automatically. > Also, instead of /@auto/ I'm wondering if we could have > /x-systemd.auto/ ? This makes it more clearly systemd's namespace, and > while I'm a big fan of the @ symbol for typographic history reasons, > it's being used in the subvolume/snapshot regimes rather haphazardly > for different purposes which might be confusing? e.g. Timeshift > expects subvolumes it manages to be prefixed with @. Meanwhile SUSE > uses @ for its (visible) root subvolume in which everything else goes. > And still ZFS uses @ for their (read-only) snapshots. I try to keep the "systemd" name out of entirely generic specs, since there are some people who have an issue with that. i.e. this way we tricked even Devuan to adopt /etc/os-release and the /run/ hierarchy, since they probably aren't even aware that these are systemd things. Other chars could be used too: /+auto/ sounds OK to me too. or /_auto/, or /=auto/ or so. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Thu, Nov 18, 2021 at 2:51 PM Chris Murphy wrote: > > How to do swapfiles? > > Currently I'm creating a "swap" subvolume in the top-level of the file > system and /etc/fstab looks like this > > UUID=$FSUUID/var/swap btrfs noatime,subvol=swap 0 0 > /var/swap/swapfile1 none swap defaults 0 0 > > This seems to work reliably after hundreds of boots. > > a. Is this naming convention for the subvolume adequate? Seems like it > can just be "swap" because the GPT method is just a single partition > type GUID that's shared by multiboot Linux setups, i.e. not arch or > distro specific > b. Is the mount point, /var/swap, OK? > c. What should the additional naming convention be for the swapfile > itself so swapon happens automatically? Actually I'm thinking of something different suddenly... because without user ownership of swapfiles, and instead systemd having domain over this, it's perhaps more like: /x-systemd.auto/swap -> /run/systemd/swap And then systemd just manages the files in that directory per policy, e.g. do on demand creation of swapfiles with variable size increments, as well as cleanup. -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
How to do swapfiles? Currently I'm creating a "swap" subvolume in the top-level of the file system and /etc/fstab looks like this UUID=$FSUUID/var/swap btrfs noatime,subvol=swap 0 0 /var/swap/swapfile1 none swap defaults 0 0 This seems to work reliably after hundreds of boots. a. Is this naming convention for the subvolume adequate? Seems like it can just be "swap" because the GPT method is just a single partition type GUID that's shared by multiboot Linux setups, i.e. not arch or distro specific b. Is the mount point, /var/swap, OK? c. What should the additional naming convention be for the swapfile itself so swapon happens automatically? Also, instead of /@auto/ I'm wondering if we could have /x-systemd.auto/ ? This makes it more clearly systemd's namespace, and while I'm a big fan of the @ symbol for typographic history reasons, it's being used in the subvolume/snapshot regimes rather haphazardly for different purposes which might be confusing? e.g. Timeshift expects subvolumes it manages to be prefixed with @. Meanwhile SUSE uses @ for its (visible) root subvolume in which everything else goes. And still ZFS uses @ for their (read-only) snapshots. -- Chris Murphy
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On 11.11.2021 19.27, Lennart Poettering wrote: On Mi, 10.11.21 10:34, Topi Miettinen (toiwo...@gmail.com) wrote: Doing this RootDirectory= would make a ton of sense too I guess, but it's not as obvious there: we'd need to extend the setting a bit I think to explicitly enable this logic. As opposed to the RootImage= case (where the logic should be default on) I think any such logic for RootDirectory= should be opt-in for security reasons because we cannot safely detect environments where this logic is desirable and discern them from those where it isn't. In RootImage= we can bind this to the right GPT partition type being used to mark root file systems that are arranged for this kind of setup. But in RootDirectory= we have no concept like that and the stuff inside the image is (unlike a GPT partition table) clearly untrusted territory, if you follow what I am babbling. My images don't have GPT partition tables, they are just raw squashfs file systems. So I'd prefer a way to identify the version either by contents of the image (/@auto/ directory), or something external, like name of the image (/path/to/image/foo.version-X.Y). Either option would be easy to implement when generating the image or directory. Hmm, so thinking about this again, I think we might get away with a check "/@auto/ exists and /usr/ does not". i.e. the second part of the check removes any ambiguity: since we unified the OS in /usr it's an excellent way to check if something is or could be an OS tree. That said: naked squashfs sucks. Always wrap your squashfs in a GPT wrapper to make things self-descriptive. It would be an extra step in image making process. I think naming the images '*.squashfs' documents them well enough for me. But if you have several RootDirectories or RootImages available for a service, what would be the way to tell which ones should be tried if there's no GPT? They can't all have the same name. I think using a specifier (like %q) would solve this issue nicely and there wouldn't be a need for /@auto/ in that case. A specifier is resolved at unit file load time only. It wouldn#t be the right fit here, since we don#t want to require that the paths specified in RootDirectory=/RootImage= are already accessible at the time PID 1 reads/parses the unit file. Well, that's out then. What about this: we could entirely independently of the proposal originally discussed here teach RootDirectory= + RootImage= one magic trick: if the path specified ends in ".auto.d/" (or so) then we'll not actually use the dir/image as-is but assume the path refers to a directory, and we'd pick the newest entry inside it as decided by strverscmp(). Or in other words, we'd establish the general rule that dirs ending in ".auto.d/" contains versioned resources inside, that we could apply here and everywhere else where it fits, too. of course intrdocuing this rule would be kind of a compat breakage because if anyone happened to have named their dirs like that already we'd suddenly do weird stuff with it the user might not expect. But I think I could live with that. A patch for that should be pretty easy to do, and be very generically useful. I kinda like it. What do you think? Yes, that could work. I'd need to rename the LVM VG to 'levy.auto.d' (maybe instead create a new VG just for these images) and the directory too but that's no problem. -Topi
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Do, 11.11.21 18:27, Lennart Poettering (mzerq...@0pointer.de) wrote: > A patch for that should be pretty easy to do, and be very generically > useful. I kinda like it. What do you think? For now I added TODO list items for these ideas: https://github.com/systemd/systemd/commit/af11e0ef843c19cbf8ccaefb93a44dbe4602f7a8#diff-337e547a950fc8a98592f10d964c1e79a304961790a8da0ce449a1f000cefabb Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Mi, 10.11.21 10:34, Topi Miettinen (toiwo...@gmail.com) wrote: > > Doing this RootDirectory= would make a ton of sense too I guess, but > > it's not as obvious there: we'd need to extend the setting a bit I > > think to explicitly enable this logic. As opposed to the RootImage= > > case (where the logic should be default on) I think any such logic for > > RootDirectory= should be opt-in for security reasons because we cannot > > safely detect environments where this logic is desirable and discern > > them from those where it isn't. In RootImage= we can bind this to the > > right GPT partition type being used to mark root file systems that are > > arranged for this kind of setup. But in RootDirectory= we have no > > concept like that and the stuff inside the image is (unlike a GPT > > partition table) clearly untrusted territory, if you follow what I am > > babbling. > > My images don't have GPT partition tables, they are just raw squashfs file > systems. So I'd prefer a way to identify the version either by contents of > the image (/@auto/ directory), or something external, like name of the image > (/path/to/image/foo.version-X.Y). Either option would be easy to implement > when generating the image or directory. Hmm, so thinking about this again, I think we might get away with a check "/@auto/ exists and /usr/ does not". i.e. the second part of the check removes any ambiguity: since we unified the OS in /usr it's an excellent way to check if something is or could be an OS tree. That said: naked squashfs sucks. Always wrap your squashfs in a GPT wrapper to make things self-descriptive. > But if you have several RootDirectories or RootImages available for a > service, what would be the way to tell which ones should be tried if there's > no GPT? They can't all have the same name. I think using a specifier (like > %q) would solve this issue nicely and there wouldn't be a need for /@auto/ > in that case. A specifier is resolved at unit file load time only. It wouldn#t be the right fit here, since we don#t want to require that the paths specified in RootDirectory=/RootImage= are already accessible at the time PID 1 reads/parses the unit file. What about this: we could entirely independently of the proposal originally discussed here teach RootDirectory= + RootImage= one magic trick: if the path specified ends in ".auto.d/" (or so) then we'll not actually use the dir/image as-is but assume the path refers to a directory, and we'd pick the newest entry inside it as decided by strverscmp(). Or in other words, we'd establish the general rule that dirs ending in ".auto.d/" contains versioned resources inside, that we could apply here and everywhere else where it fits, too. of course intrdocuing this rule would be kind of a compat breakage because if anyone happened to have named their dirs like that already we'd suddenly do weird stuff with it the user might not expect. But I think I could live with that. A patch for that should be pretty easy to do, and be very generically useful. I kinda like it. What do you think? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On 9.11.2021 23.03, Lennart Poettering wrote: On Di, 09.11.21 19:48, Topi Miettinen (toiwo...@gmail.com) wrote: i.e. we'd drop the counting suffix. Could we have this automatic versioning scheme extended also to service RootImages & RootDirectories as well? If the automatic versioning was also extended to services, we could have A/B testing also for RootImages with automatic fallback to last known good working version. At least in the case of RootImage= this was my implied assumption: we'd implement the same there, since that uses the exact same code as systemd-nspawn's image dissection and we definitely want it there. Doing this RootDirectory= would make a ton of sense too I guess, but it's not as obvious there: we'd need to extend the setting a bit I think to explicitly enable this logic. As opposed to the RootImage= case (where the logic should be default on) I think any such logic for RootDirectory= should be opt-in for security reasons because we cannot safely detect environments where this logic is desirable and discern them from those where it isn't. In RootImage= we can bind this to the right GPT partition type being used to mark root file systems that are arranged for this kind of setup. But in RootDirectory= we have no concept like that and the stuff inside the image is (unlike a GPT partition table) clearly untrusted territory, if you follow what I am babbling. My images don't have GPT partition tables, they are just raw squashfs file systems. So I'd prefer a way to identify the version either by contents of the image (/@auto/ directory), or something external, like name of the image (/path/to/image/foo.version-X.Y). Either option would be easy to implement when generating the image or directory. But if you have several RootDirectories or RootImages available for a service, what would be the way to tell which ones should be tried if there's no GPT? They can't all have the same name. I think using a specifier (like %q) would solve this issue nicely and there wouldn't be a need for /@auto/ in that case. Or in other words: to enable this for RootDirectory= we probably need a new option RootDirectoryVersioned= or so that takes a boolean. Wouldn't this be unnecessary, if the version magic would be available explicitly as specifier to the path of RootDirectory= or RootImage=? Then we know that the configuring user made this decision. -Topi
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Di, 09.11.21 19:48, Topi Miettinen (toiwo...@gmail.com) wrote: > > i.e. we'd drop the counting suffix. > > Could we have this automatic versioning scheme extended also to service > RootImages & RootDirectories as well? If the automatic versioning was also > extended to services, we could have A/B testing also for RootImages with > automatic fallback to last known good working version. At least in the case of RootImage= this was my implied assumption: we'd implement the same there, since that uses the exact same code as systemd-nspawn's image dissection and we definitely want it there. Doing this RootDirectory= would make a ton of sense too I guess, but it's not as obvious there: we'd need to extend the setting a bit I think to explicitly enable this logic. As opposed to the RootImage= case (where the logic should be default on) I think any such logic for RootDirectory= should be opt-in for security reasons because we cannot safely detect environments where this logic is desirable and discern them from those where it isn't. In RootImage= we can bind this to the right GPT partition type being used to mark root file systems that are arranged for this kind of setup. But in RootDirectory= we have no concept like that and the stuff inside the image is (unlike a GPT partition table) clearly untrusted territory, if you follow what I am babbling. Or in other words: to enable this for RootDirectory= we probably need a new option RootDirectoryVersioned= or so that takes a boolean. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Di, 09.11.21 14:48, Ludwig Nussel (ludwig.nus...@suse.de) wrote: > > and so on. Until boot succeeds in which case we'd rename it: > > > >/@auto/root-x86-64:fedora_36.0 > > > > i.e. we'd drop the counting suffix. > > Thanks for the explanation and pointer! > > Need to think aloud a bit :-) > > That method basically works for systems with read-only root. Ie where > the next OS to boot is in a separate snapshot, eg MicroOS. > A traditional system with rw / on btrfs would stay on the same subvolume > though. Ie the "root-x86-64:fedora_36.0" volume in the example. In > openSUSE package installation automatically leads to ro snapshot > creation. In order to fit in I suppose those could then be named eg. > "root-x86-64:fedora_36.N+0" with increasing N. Due to the +0 the > subvolume would never be booted. > > Anyway, let's assume the ro case and both efi partition and btrfs volume > use this scheme. That means each time some packages are updated we get a > new subvolume. After reboot the initrd in the efi partition would try to > boot that new subvolume. If it reaches systemd-bless-boot.service the > new subvolume becomes the default for the future. > > So far so good. What if I discover later that something went wrong > though? Some convenience tooling to mark the current version bad again > would be needed. In the sd-boot/kernel case any time you like you can rename an entry to "…+0" to mark it as "bad", you could drop the suffix to mark it as "good" or you could mark it as "+3" to mark it as "dont-know/try-again". Now, at least in theory we could declare the same for this new directory auto-discovery scheme. But I am not entirely sure this will work out trivially IRL because I have the suspicion one cannot rename subvolumes which are the source of a bind mount (i.e. once you boot into one root subtree, then it might be impossible to rename that top-level inode without rebooting first). Would be something to try out. If it doesn't work it might suffice to move things one level down, i.e. that the dir that actually becomes root is /@auto/root-x86-64:fedora_36.0/payload/ or so, instead of just /@auto/root-x86-64:fedora_36.0/. I think that that would work, and might be desirable anyway so that the enumeration of entries doesn't already leak fs attributes/ownership/access modes/… of actual root fs. > But then having Tumbleweed in mind it needs some capability to boot any > old snapshot anyway. I guess the solution here would be to just always > generate a bootloader entry, independent of whether a kernel was > included in an update. Each entry would then have to specify kernel, > initrd and the root subvolume to use. > This approach would work with a separate usr volume also. In that case > kernel, initrd, root and usr volume need to be linked by means of a > bootloader entry. For the GPT case if you want to bind a kernel together with a specific root fs, you'd do this by specifying 'root=PARTLABEL=fooos_0.3' on the kernel cmdline. I'd take inspiration from that and maybe introduce 'rootentry=fedora_36.2' or so which would then be honoured by the logic we are discussing here, and would hard override which subdir to use, regardless of versioning preference, assesment counting and so on. (Yeah, the subvol= mount option for btrfs would work too, but as mentioned I'd keep this reasonably independent of btrfs where its easy, plain dirs otherwise are fine too after all. Which reminds me, recent util-linux implements the X-mount.subdir= mount option, which means one could also use 'rootflags=X-mount.subdir=@auto/fedora_36.2' as non-btrfs-specific way to express the btrfs-specific 'rootflags=subvol=@auto/fedora_36.2') Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On 8.11.2021 17.32, Lennart Poettering wrote: Besides the GPT auto-discovery where versioning is implemented the way I mentioned, there's also the sd-boot boot loader which does roughly the same kind of OS versioning with the boot entries it discovers. So right now, you can already chose whether: 1. you want to do OS versioning on the boot loader entry level: name your EFI binary fooos-0.1.efi (or fooos-0.1.conf, as defined by the boot loader spec) and similar and the boot loader automatically picks it up, makes sense of it and boots the newest version installed. 2. you want to do OS versioning on the GPT partition table level: name your partitions "fooos-0.1" and similar, with the right GPT type, and tools such as systemd-nspawn, systemd-dissect, portable services, RootImage= in service unit files all will be able to automatically pick the newest version of the OS among the ones in the image. and now: 3. If we implement what I proprose above then you could do OS version on the file system level too. (Or you could do a combination of the above, if you want — which is highly desirable I think in case you want a universal image that can boot on bare metal and in nspawn in a nice versioned way.) Now, in sd-boot's versioning logic we implement an automatic boot assesment logic on top of the OS versioning: if you add a "+x-y" string into the boot entry name we use it as x=tries-left and y=tries-done counters. i.e. fooos-0.1+3-0.efi is semantically the same as fooos-0.1.efi, except that there are 3 attempts left and 0 done yet. On each boot attempt the boot loader decreases x and increases y. i.e. fooos-0.1+3-0.efi → fooos-0.1+2-1.efi → fooos-0.1+1-2.efi → fooos-0.1+0-3.efi. If a boot succeeds the two counters are dropped from the filename, i.e. → fooos-0.1.efi. For details see: https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT. Now, why am I mentioning all this? Right now this assessment counter logic is only implemented for the OS versioning as implemented by sd-boot. But I think it would make a ton of sense to implement the same scheme for the GPT partition table OS versioning, and then also for the fs-level OS versioning as proposed in this thread. Or to say this explicitly: we could define the spec to say that if we encounter: /@auto/root-x86-64:fedora_36.0+3-0 on first boot attempt we'd rename it: /@auto/root-x86-64:fedora_36.0+2-1 and so on. Until boot succeeds in which case we'd rename it: /@auto/root-x86-64:fedora_36.0 i.e. we'd drop the counting suffix. Could we have this automatic versioning scheme extended also to service RootImages & RootDirectories as well? If the automatic versioning was also extended to services, we could have A/B testing also for RootImages with automatic fallback to last known good working version. In my setup, all services use either a RootImage= or RootDirectory= (for early boot services). Most of them don't care about kernel version, so the services use a shared drop-in (LVM logical volume 'levy'): [Service] RootImage=/dev/levy/%p-all.squashfs The device path will then be for example /dev/levy/systemd-networkd-all.squashfs. For udev and systemd-modules, kernel version is used (/usr/local/lib/rootimages/systemd-udevd-5.14.0-2-amd64.dir), so the services use this drop-in: [Service] RootDirectory=/usr/local/lib/rootimages/%p-%v.dir Instead of (or in addition to) /@auto/ paths inside the RootImage= / RootDirectory=, the version could be available as modifier to part of device or directory pathname, for example: [Service] RootImage=/dev/levy/%p-all-@auto.squashfs or [Service] RootImage=/usr/local/lib/rootimages/%p-%v-@auto.squashfs Maybe %a instead of @auto. This would then match /dev/levy/systemd-networkd-all-2021-11.09.0.squashfs as the highest version, but if that refuses to start, PID1 would try to start /dev/levy/systemd-networkd-all-2021-11.08.2.squashfs instead. -Topi
Re: [systemd-devel] the need for a discoverable sub-volumes specification
Lennart Poettering wrote: > On Mo, 08.11.21 14:24, Ludwig Nussel (ludwig.nus...@suse.de) wrote: > [...] >> MicroOS has a similar situation. It edits /etc/fstab. > > microoos is a suse thing? Yeah. https://get.opensuse.org/microos/ It uses regular package management but instead of installing rpms in the running system (which is read-only) it does so in a btrfs snapshot. >> Anyway in the above example I guess if you install some updates you'd >> get eg root-x86-64:fedora_37.2, .3, .4 etc? > > [...] > > The GPT auto-discovery thing basically does an strverscmp() on the > full GPT partition label string, i.e. it does not attempt to split a > name from a version, but assumes strverscmp() will handle a common > prefix nicely anyway. I'd do it the exact same way here: if there are > multiple options, then pick the newest as per strverscmp(), but that > also means it's totally fine to not version your stuff and instead of > calling it "root-x86-64:fedora_37.3" could could also just name it > "root-x86-64:fedora" if you like, and then not have any versioning. Nice. Means it might even work with just "root" for systems that get installed the traditional way and no intention to move the hard disk around. >> I suppose the autodetection is meant to boot the one sorted last. What >> if that one turns out to be bad though? How to express rollback in that >> model? > > Besides the GPT auto-discovery where versioning is implemented the way > I mentioned, there's also the sd-boot boot loader which does roughly > the same kind of OS versioning with the boot entries it discovers> [...] > For details see: https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT. > [...] > Or to say this explicitly: we could define the spec to say that if > we encounter: > >/@auto/root-x86-64:fedora_36.0+3-0 > > on first boot attempt we'd rename it: > >/@auto/root-x86-64:fedora_36.0+2-1 > > and so on. Until boot succeeds in which case we'd rename it: > >/@auto/root-x86-64:fedora_36.0 > > i.e. we'd drop the counting suffix. Thanks for the explanation and pointer! Need to think aloud a bit :-) That method basically works for systems with read-only root. Ie where the next OS to boot is in a separate snapshot, eg MicroOS. A traditional system with rw / on btrfs would stay on the same subvolume though. Ie the "root-x86-64:fedora_36.0" volume in the example. In openSUSE package installation automatically leads to ro snapshot creation. In order to fit in I suppose those could then be named eg. "root-x86-64:fedora_36.N+0" with increasing N. Due to the +0 the subvolume would never be booted. Anyway, let's assume the ro case and both efi partition and btrfs volume use this scheme. That means each time some packages are updated we get a new subvolume. After reboot the initrd in the efi partition would try to boot that new subvolume. If it reaches systemd-bless-boot.service the new subvolume becomes the default for the future. So far so good. What if I discover later that something went wrong though? Some convenience tooling to mark the current version bad again would be needed. But then having Tumbleweed in mind it needs some capability to boot any old snapshot anyway. I guess the solution here would be to just always generate a bootloader entry, independent of whether a kernel was included in an update. Each entry would then have to specify kernel, initrd and the root subvolume to use. This approach would work with a separate usr volume also. In that case kernel, initrd, root and usr volume need to be linked by means of a bootloader entry. Means the counter mechanism wouldn't actually be needed on fs or partition level in practice after all. It's sufficient in the bootloader. cu Ludwig -- (o_ Ludwig Nussel //\ V_/_ http://www.suse.com/ SUSE Software Solutions Germany GmbH, GF: Ivo Totev HRB 36809 (AG Nürnberg)
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Mo, 08.11.21 14:24, Ludwig Nussel (ludwig.nus...@suse.de) wrote: > Lennart Poettering wrote: > > [...] > > 3. Inside the "@auto" dir of the "super-root" fs, have dirs named > >[:]. The type should have a similar vocubulary > >as the GPT spec type UUIDs, but probably use textual identifiers > >rater than UUIDs, simply because naming dirs by uuids is > >weird. Examples: > > > >/@auto/root-x86-64:fedora_36.0/ > >/@auto/root-x86-64:fedora_36.1/ > >/@auto/root-x86-64:fedora_37.1/ > >/@auto/home/ > >/@auto/srv/ > >/@auto/tmp/ > > > >Which would be assembled by the initrd into the following via bind > >mounts: > > > >/ → /@auto/root-x86-64:fedora_37.1/ > >/home/→ /@auto/home/ > >/srv/ → /@auto/srv/ > >/var/tmp/ → /@auto/tmp/ > > > > If we do this, then we should also leave the door open so that maybe > > ostree can be hooked up with this, i.e. if we allow the dirs in > > /@auto/ to actually be symlinks, then they could put their ostree > > checkotus wherever they want and then create a symlink > > /@auto/root-x86-64:myostreeos pointing to it, and their image would be > > spec conformant: we'd boot into that automatically, and so would > > nspawn and similar things. Thus they could switch their default OS to > > boot into without patching kernel cmdlines or such, simply by updating > > that symlink, and vanille systemd would know how to rearrange things. > > MicroOS has a similar situation. It edits /etc/fstab. microoos is a suse thing? > Anyway in the above example I guess if you install some updates you'd > get eg root-x86-64:fedora_37.2, .3, .4 etc? Well, the spec wouldn't mandate that. But yeah, the idea is that you could do it like that if you want. What's important is to define the vocabulary to make this easy and possible, but of course, whether people follow such an update scheme is up to them. I mean, it's the same as with the GPT auto discovery logic: it already implements such a versioning scheme because its easy to implement, but if you don't want to take benefit of the versioning, then don't, it's fine regardless. the logic we'd define here is about *consuming* available OS root filesystems, not about *installing* them, after all. The GPT auto-discovery thing basically does an strverscmp() on the full GPT partition label string, i.e. it does not attempt to split a name from a version, but assumes strverscmp() will handle a common prefix nicely anyway. I'd do it the exact same way here: if there are multiple options, then pick the newest as per strverscmp(), but that also means it's totally fine to not version your stuff and instead of calling it "root-x86-64:fedora_37.3" could could also just name it "root-x86-64:fedora" if you like, and then not have any versioning. > I suppose the autodetection is meant to boot the one sorted last. What > if that one turns out to be bad though? How to express rollback in that > model? Besides the GPT auto-discovery where versioning is implemented the way I mentioned, there's also the sd-boot boot loader which does roughly the same kind of OS versioning with the boot entries it discovers. So right now, you can already chose whether: 1. you want to do OS versioning on the boot loader entry level: name your EFI binary fooos-0.1.efi (or fooos-0.1.conf, as defined by the boot loader spec) and similar and the boot loader automatically picks it up, makes sense of it and boots the newest version installed. 2. you want to do OS versioning on the GPT partition table level: name your partitions "fooos-0.1" and similar, with the right GPT type, and tools such as systemd-nspawn, systemd-dissect, portable services, RootImage= in service unit files all will be able to automatically pick the newest version of the OS among the ones in the image. and now: 3. If we implement what I proprose above then you could do OS version on the file system level too. (Or you could do a combination of the above, if you want — which is highly desirable I think in case you want a universal image that can boot on bare metal and in nspawn in a nice versioned way.) Now, in sd-boot's versioning logic we implement an automatic boot assesment logic on top of the OS versioning: if you add a "+x-y" string into the boot entry name we use it as x=tries-left and y=tries-done counters. i.e. fooos-0.1+3-0.efi is semantically the same as fooos-0.1.efi, except that there are 3 attempts left and 0 done yet. On each boot attempt the boot loader decreases x and increases y. i.e. fooos-0.1+3-0.efi → fooos-0.1+2-1.efi → fooos-0.1+1-2.efi → fooos-0.1+0-3.efi. If a boot succeeds the two counters are dropped from the filename, i.e. → fooos-0.1.efi. For details see: https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT. Now, why am I mentioning all this? Right now this assessment counter logic is only implemented for the OS versioning as implemented by sd-boot. But I think it would make a ton of sense to
Re: [systemd-devel] the need for a discoverable sub-volumes specification
Lennart Poettering wrote: > [...] > 3. Inside the "@auto" dir of the "super-root" fs, have dirs named >[:]. The type should have a similar vocubulary >as the GPT spec type UUIDs, but probably use textual identifiers >rater than UUIDs, simply because naming dirs by uuids is >weird. Examples: > >/@auto/root-x86-64:fedora_36.0/ >/@auto/root-x86-64:fedora_36.1/ >/@auto/root-x86-64:fedora_37.1/ >/@auto/home/ >/@auto/srv/ >/@auto/tmp/ > >Which would be assembled by the initrd into the following via bind >mounts: > >/ → /@auto/root-x86-64:fedora_37.1/ >/home/→ /@auto/home/ >/srv/ → /@auto/srv/ >/var/tmp/ → /@auto/tmp/ > > If we do this, then we should also leave the door open so that maybe > ostree can be hooked up with this, i.e. if we allow the dirs in > /@auto/ to actually be symlinks, then they could put their ostree > checkotus wherever they want and then create a symlink > /@auto/root-x86-64:myostreeos pointing to it, and their image would be > spec conformant: we'd boot into that automatically, and so would > nspawn and similar things. Thus they could switch their default OS to > boot into without patching kernel cmdlines or such, simply by updating > that symlink, and vanille systemd would know how to rearrange things. MicroOS has a similar situation. It edits /etc/fstab. Anyway in the above example I guess if you install some updates you'd get eg root-x86-64:fedora_37.2, .3, .4 etc? I suppose the autodetection is meant to boot the one sorted last. What if that one turns out to be bad though? How to express rollback in that model? cu Ludwig -- (o_ Ludwig Nussel //\ V_/_ http://www.suse.com/ SUSE Software Solutions Germany GmbH, GF: Ivo Totev HRB 36809 (AG Nürnberg)
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On Mi, 03.11.21 13:52, Chris Murphy (li...@colorremedies.com) wrote: > There is a Discoverable Partitions Specification > http://systemd.io/DISCOVERABLE_PARTITIONS/ > > The problem with this for Btrfs, ZFS, and LVM is a single volume can > represent multiple use cases via multiple volumes: subvolumes (btrfs), > datasets (ZFS), and logical volumes (LVM). I'll just use the term > sub-volume for all of these, but I'm open to some other generic term. > > None of the above volume managers expose the equivalent of GPT's > partition type GUID per sub-volume. > > One possibility that's available right now is the sub-volume's name. > All we need is a spec for that naming convention. One of the strengths of the GPT arrangement is that we can very naturally use the type system to identify what kind of data something contains, and then use the gpt partition label to say what it's name is, and version (and we could encode more if we wanted). We use that to implement a very simple A/B logic in the image dissection logic of systemd-gpt-auto-generator, systemd-nspawn, systemd-dissect and so on: you can have multiple partitions named "foo-0.1", "foo-0.2", "foo-0.3" and so on, all of the same type 8484680c-9521-48c6-9c11-b0720656f69e (the type for /usr/ partitions ofr x86-64), and then we'll automatically pick the newest version "foo-0.3". hence, at the baseline any such spec should have similar concepts, and clearly be able to identify both type *and* name/version, otherwise it couldn't match the gpt spec feature-wise. > An early prototype of this idea was posted by Lennart: > https://0pointer.net/blog/revisiting-how-we-put-together-linux-systems.html Given that the gpt spec is reality and kinda established (in contrast to what the blog story describes) i'd really focus on adding a similar-in-spirit spec that picks up from there, and tries to minimize conceptual differences. Note that I'd distance any such spec from btrfs btw. btrfs subvolumes are in many ways regular directories. Thus I think the spec should only define how directories are supposed to be assembled, and if those directories are actually subvolumes great, but the spec can be entirely independent of that, i.e. it should be possible to implement it on ext4 and xfs too. (I personally think LVM — as an enterprise storage layer — is pretty uninteresting for any automatic handling like this in systemd though. If LVM wants automatic assembly they should do things themselves, I doubt systemd needs to care. Moreover, I have the impression that people who are into LVM and the pain it brings are probably not the type of people who like automatic handling like systemd-gpt-auto-generator brings it. – Yes, you might notice, I am not a fan of LVM. I don't think ZFS is interesting either, i.e. I wouldn't touch this with a 10m pole, given how unresolved their licensing mess is. But I'd recommend them to just implement the btrfs subvol ioctls, so that they could get the hookup for free. I understand their semantics are similar enough to make this possible.) I think implementation of a spec like this is not entirely trivial. The thing is that we can't determine what we need to do just by looking at the disk. We'd have to look for a specially marked root fs, and then mount it (which might first involve luks/integrity/… and thus interactivity), and then look into it, and then mount some dirs it includes in a new way. This is a substantially more complex logic — the GPT stuff is much simpler: we just look at the disk, figure things out, and then generate mount units for it. And that's really it. Anyway, I am not against this, I am mostly just saying that it isn't as easy as it might look to get this working robustly, i.e. the initrd probably would have to do things in multiple phases: first mount the relevant fs to /sysauto/ or so, and then after looking at this mount the right subdirs into /sysroot/ (as we usually do) and only then transition into it. Anyway, I think a spec like I'd do it today, taking all of the above into account would look a bit like this: 1. define a new gpt type uuid for these specially arranged "super-root" file systems (a single one for all archs). (i call this "super-root" to make clear that the it's not a regular root fs but one that contains potentially multiple in parallel) 2. inside this "super-root" fs, have one top-level dir, maybe called "@auto" or something like that. Why do this? two reasons: so that we can recognize an implementation of the spec both on the block level (via the gpt type id) and on the fs level (via this specially name top-level dir). The latter is interesting for potential MBR compat. And the other reason is if this is used on ext4 we don't get confused by lost+found. (also people could place whatever else they want in the root dir of the fs, for example ostree could do its thing in some other subdir of the root fs if it wants to) 3. Inside the "@auto" dir of the "super-root" fs,
Re: [systemd-devel] the need for a discoverable sub-volumes specification
On 11/3/21 12:52 PM, Chris Murphy wrote: There is a Discoverable Partitions Specification http://systemd.io/DISCOVERABLE_PARTITIONS/ The problem with this for Btrfs, ZFS, and LVM is a single volume can represent multiple use cases via multiple volumes: subvolumes (btrfs), datasets (ZFS), and logical volumes (LVM). I'll just use the term sub-volume for all of these, but I'm open to some other generic term. None of the above volume managers expose the equivalent of GPT's partition type GUID per sub-volume. You can't trust that information anyway. At the end of the day, you attempt mount a block device. This gets even more complicated as volumes may nest. That is, you could have a logical volume in LVM that is a phyical volume in a lower context which is part of a volume group containing logical volumes. Now.. probably doesn't make sense in most cases to try to take things that far of course. Perhaps I should have used a better combo of layering, like something with logical volumes and software RAIDing (plus encryption, etc. lots of dev mapper possibilities). Let's just say, there's a reason for the explicitness of fstab. Guessing can be done, but at the end of the day, it's going to be a guess. Could be a very bad guess. One possibility that's available right now is the sub-volume's name. All we need is a spec for that naming convention. An early prototype of this idea was posted by Lennart: https://0pointer.net/blog/revisiting-how-we-put-together-linux-systems.html Lennart previously mentioned elsewhere that this is probably outdated. So let's update it and bring it more in line with the purpose and goal set out in the discoverable partition spec, which is to obviate the need for /etc/fstab. You'll have to move the "explicit intent" data in the "things" you discover. It's not there today and there are good reason why it shouldn't be there. You may not like fstab, but it is an abstraction which prevents making assumptions about the underlying block devices. Not saying you can't make an fstab alternative, but at the end of day, it's an fstab alternative (you've just moved things from "here" to "there"). Or, you've placed a behavioral assumption onto things that wasn't there before. And I'd be careful about the latter. A lot of my block devices are partitionless, as the good Lord intended things to be.
Re: [systemd-devel] the need for a discoverable sub-volumes specification
Lennart most recently (about a year ago) wrote on this in a mostly unrelated Fedora devel@ thread. I've found the following relevant excerpts and provide the source URL as well. BTW, we once upon a time added a TODO list item of adding a btrfs generator to systemd, similar to the existing GPT generator: it would look at the subvolumes of the root btrfs fs, and then try to mount stuff it finds if it follows a certain naming scheme. https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/message/M756KVDNY65VONU3GA5CSXB4LBJD3ZIW/ All I am asking for is to make this simple and robust and forward looking enough so that we can later add something like the generator I proposed without having to rerrange anything. i.e. make the most basic stuff self-describing now, even if the automatic discovering/mounting of other subvols doesn't happen today, or even automatic snapshotting. By doing that correctly now, you can easily extend things later incrementally without breaking stuff, just by *adding* stuff. And you gain immediate compat with "systemd-nspawn --image=" right-away as the basic minimum, which already is great. https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/message/JB2PMFPPRS4YII3Q4BMHW3V33DM2MT44/ We manage to name RPMs with versions, epochs, archs and so on, I doubt we need much more for naming subvolumes to auto-assemble. https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/message/VBVFQOG5EYI73CGFVCLMGX72IZUCQEYG/ -- Chris Murphy