Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-31 Thread Chris Murphy
On Thu, Dec 30, 2021 at 3:59 PM Chris Murphy  wrote:
>
> ZFS uses volume and user properties which we could probably mimic with
> xattr. I thought I asked about xattr instead of subvolume names at one
> point in the thread but I don't see it. So instead of using subvolume
> names, what about stuffing this information in xattr? My gut instinct
> is this is less transparent and user friendly, it requires more tools
> to know how to user to troubleshoot and fix, etc.

Separate from whether the obscurity of an xattr is a good idea or not,
read-only snapshots can't have xattr added, removed, or modified. We
can rename read-only snapshots, however.

While we could unset the ro property, that also wipes received UUID
used by send/receive. And while we could make an rw snapshot of the ro
snapshot, modify the xattr, then make an ro snapshot of the rw
snapshot, this alters the parent UUI also used by send/receive. So it
complicates send/receive workflows as a potential update mechanism, or
for backup/restore, for anything that tracks these UUIDs, e.g. btrbk.


-- 
Chris Murphy


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-30 Thread Chris Murphy
(I'm sorta not doing a great job of using "sub-volume" to mean
generically any of Btrfs subvolume or a directory or a logical volume,
so hopefully anyone still following can make the leap that I don't
intend this spec to be Btrfs specific. I like it being general
purpose.)

On Tue, Dec 21, 2021 at 6:57 AM Ludwig Nussel  wrote:
>
> The way btrfs is used in openSUSE is based on systems from ten years
> ago. A lot has changed since then. Now with the idea to have /usr on a
> separate read-only subvolume the current model doesn't really work very
> well anymore IMO. So I think there's a window of opportunity to change
> the way openSUSE does things :-)

ZFS uses volume and user properties which we could probably mimic with
xattr. I thought I asked about xattr instead of subvolume names at one
point in the thread but I don't see it. So instead of using subvolume
names, what about stuffing this information in xattr? My gut instinct
is this is less transparent and user friendly, it requires more tools
to know how to user to troubleshoot and fix, etc.


--
Chris Murphy


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-28 Thread Chris Murphy
On Tue, Dec 21, 2021 at 6:57 AM Ludwig Nussel  wrote:
>
> Chris Murphy wrote:

> > The part I'm having a hard time separating is the implicit case (use
> > some logic to assemble the correct objects), versus explicit (the
> > bootloader snippet points to a root and the root contains an fstab -
> > nothing about assembly is assumed). And should both paradigms exist
> > concurrently in an installed system, and how to deconflict?
>
> Not sure there is a conflict. The discovery logic is well defined after
> all. Also I assume normal operation wouldn't mix the two. Package
> management or whatever installs updates would automatically do the right
> thing suitable for the system at hand.

rootflags=subvol/subvolid= should override the discoverable
sub-volumes generator

I don't expect rootflags is normally used in a discoverable
sub-volumes workflow, but if the user were to add it for some reason,
we'd want it to be favored.


>
> > Further, (open)SUSE tends to define the root to boot via `btrfs
> > subvolume set-default` which is information in the file system itself,
> > neither in the bootloader snipper nor in the naming convention. It's
> > neat, but also not discoverable. If users are trying to
>
> The way btrfs is used in openSUSE is based on systems from ten years
> ago. A lot has changed since then. Now with the idea to have /usr on a
> separate read-only subvolume the current model doesn't really work very
> well anymore IMO. So I think there's a window of opportunity to change
> the way openSUSE does things :-)

I think the transactional model can accommodate better anyway, and is
the direction I'd like to go in with Fedora. Make updates/upgrades
happen out of band (in a container on a snapshot). We can apply
resource control limits so that the upgrade process doesn't negatively
impact the user's higher priority workload. If the update fails to
complete or fails a set of simple tests - the snapshot is simply
discarded. No harm done to the running system. If it passes checks,
then its name is changed to indicate it's the favored "next root"
following reboot. And we don't have to keep a database to snapshot,
assemble, and discard things, it can all be done by naming scheme.

I think the naming scheme should include some sort of "in-progress"
tag, so it's discoverable such a sub-volume is (a) not active (b) in
some state of flux that potentially was interrupted (c) isn't critical
to the system. Such a sub-volume should either be destroyed (failed
update) or renamed (update succeeds). If the owning process were to
fail (crash, powerfailure), the next time it runs to check for
updates, it would discover this "in-progress" sub-volume and remove it
(assume it's in a failed state).


-- 
Chris Murphy


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-21 Thread Ludwig Nussel
Chris Murphy wrote:
> On Tue, Nov 9, 2021 at 8:48 AM Ludwig Nussel  wrote:
>> Lennart Poettering wrote:
>>> Or to say this explicitly: we could define the spec to say that if
>>> we encounter:
>>>
>>>/@auto/root-x86-64:fedora_36.0+3-0
>>>
>>> on first boot attempt we'd rename it:
>>>
>>>/@auto/root-x86-64:fedora_36.0+2-1
>>>
>>> and so on. Until boot succeeds in which case we'd rename it:
>>>
>>>/@auto/root-x86-64:fedora_36.0
>>>
>>> i.e. we'd drop the counting suffix.
>>
>> Thanks for the explanation and pointer!
>>
>> Need to think aloud a bit :-)
>>
>> That method basically works for systems with read-only root. Ie where
>> the next OS to boot is in a separate snapshot, eg MicroOS.
>> A traditional system with rw / on btrfs would stay on the same subvolume
>> though. Ie the "root-x86-64:fedora_36.0" volume in the example. In
>> openSUSE package installation automatically leads to ro snapshot
>> creation. In order to fit in I suppose those could then be named eg.
>> "root-x86-64:fedora_36.N+0" with increasing N. Due to the +0 the
>> subvolume would never be booted.
> 
> Yeah the N+0 subvolumes could be read-only snapshots, their purpose is
> only to be used as an immutable checkpoint from which to produce
> derivatives, read-write subvolumes. But what about the case of being
> in a preboot environment, and have no way (yet) to rename or create a
> new snapshot to boot, and you need to boot one of these read-only
> snapshots? What if the bootloader was smart enough to add the proper
> volatile overlay arrangement anytime an N+0 subvolume is chosen for
> boot? Is that plausible and useful?

The initrd would have to make those arrangements. AFAICT so far
openSUSE systems just boot into such a RO environment without any
preparations. So fully read-only, just enough to run snapper to create a
usable snapshot again.

>> Anyway, let's assume the ro case and both efi partition and btrfs volume
>> use this scheme. That means each time some packages are updated we get a
>> new subvolume. After reboot the initrd in the efi partition would try to
>> boot that new subvolume. If it reaches systemd-bless-boot.service the
>> new subvolume becomes the default for the future.
>>
>> So far so good. What if I discover later that something went wrong
>> though? Some convenience tooling to mark the current version bad again
>> would be needed.
>>
>> But then having Tumbleweed in mind it needs some capability to boot any
>> old snapshot anyway. I guess the solution here would be to just always
>> generate a bootloader entry, independent of whether a kernel was
>> included in an update. Each entry would then have to specify kernel,
>> initrd and the root subvolume to use.
> 
> The part I'm having a hard time separating is the implicit case (use
> some logic to assemble the correct objects), versus explicit (the
> bootloader snippet points to a root and the root contains an fstab -
> nothing about assembly is assumed). And should both paradigms exist
> concurrently in an installed system, and how to deconflict?

Not sure there is a conflict. The discovery logic is well defined after
all. Also I assume normal operation wouldn't mix the two. Package
management or whatever installs updates would automatically do the right
thing suitable for the system at hand.

> Further, (open)SUSE tends to define the root to boot via `btrfs
> subvolume set-default` which is information in the file system itself,
> neither in the bootloader snipper nor in the naming convention. It's
> neat, but also not discoverable. If users are trying to

The way btrfs is used in openSUSE is based on systems from ten years
ago. A lot has changed since then. Now with the idea to have /usr on a
separate read-only subvolume the current model doesn't really work very
well anymore IMO. So I think there's a window of opportunity to change
the way openSUSE does things :-)

cu
Ludwig

-- 
 (o_   Ludwig Nussel
 //\
 V_/_  http://www.suse.com/
SUSE Software Solutions Germany GmbH, GF: Ivo Totev
HRB 36809 (AG Nürnberg)


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-20 Thread Lennart Poettering
On Fr, 10.12.21 12:25, Chris Murphy (li...@colorremedies.com) wrote:

> On Thu, Nov 11, 2021 at 12:28 PM Lennart Poettering
>  wrote:
>
> > That said: naked squashfs sucks. Always wrap your squashfs in a GPT
> > wrapper to make things self-descriptive.
>
> Do you mean the image file contains a GPT, and the squashfs is a
> partition within the image? Does this recommendation apply to any
> image? Let's say it's a Btrfs image. And in the context of this
> thread, the GPT partition type GUID would be the "super-root" GUID?

Yes, I'd always add a GPT wrapper around disk images. It's simple,
extensible and first and foremost self-descriptive: you know what you
are looking at, safely, before parsing the fs. It opens the door for
adding verity data in a very natural way, and more.

Lennart

--
Lennart Poettering, Berlin


[systemd-devel] Antw: Re: Antw: [EXT] Re: [systemd‑devel] the need for a discoverable sub‑volumes specification

2021-12-13 Thread Ulrich Windl
>>> Chris Murphy  schrieb am 10.12.2021 um 16:59 in
Nachricht
:
> On Mon, Nov 22, 2021 at 3:02 AM Ulrich Windl
>  wrote:
>>
>> >>> Lennart Poettering  schrieb am 19.11.2021 um 
>> >>> 10:17
>> in
>> Nachricht :
>> > On Do, 18.11.21 14:51, Chris Murphy (li...@colorremedies.com) wrote:
>> >
>> >> How to do swapfiles?
>> >
>> > Is this really a concept that deserves too much attention? I mean, I
>> > have the suspicion that half the benefit of swap space is that it can
>> > act as backing store for hibernation. But swap files are icky for that
>> > since that means the resume code has to mount the fs first, but given
>> > the fs is dirty during the hibernation state this is highly problematic.
>> >
>> > Hence, I have the suspicion that if you do swap you should probably do
>> > swap partitions, not swap files, because it can cover all usecase:
>> > paging *and* hibernation.
>>
>> Out of curiosity: What about swap LVs, possibly thin-provisioned ones?
> 
> I don't think that's supported.
> https://listman.redhat.com/archives/linux-lvm/2020-November/msg00039.html 
> 

AFAIK Redhat-based Qubes OS uses it; that's where I first saw it.

> 
> -- 
> Chris Murphy






Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-10 Thread Chris Murphy
On Thu, Nov 4, 2021 at 9:39 AM Lennart Poettering
 wrote:

> 3. Inside the "@auto" dir of the "super-root" fs, have dirs named
>[:]. The type should have a similar vocubulary
>as the GPT spec type UUIDs, but probably use textual identifiers
>rater than UUIDs, simply because naming dirs by uuids is
>weird. Examples:
>
>/@auto/root-x86-64:fedora_36.0/
>/@auto/root-x86-64:fedora_36.1/
>/@auto/root-x86-64:fedora_37.1/
>/@auto/home/
>/@auto/srv/
>/@auto/tmp/
>
>Which would be assembled by the initrd into the following via bind
>mounts:
>
>/ → /@auto/root-x86-64:fedora_37.1/
>/home/→ /@auto/home/
>/srv/ → /@auto/srv/
>/var/tmp/ → /@auto/tmp/

What about arbitrary mountpoints and their subvolumes? Things we can't
predict in advance for all use cases? For example:

For my non-emphemeral systems:

* /var/log is a directory contained in subvolume "varlog-x86-64:fedora.35"
* /var/lib/libvirt/images is a directory contained in subvolume
"varlibvirtimages-x86-64:fedora.35"
* /var/lib/flatpak is a directory contained in a subvolume
"varlibflatpak-x86-64:any" - as it isn't Fedora specific, uses its own
versioning so in this case I'd expect it gets mounted with any
distribution.

These exist so they are excluded from a snapshot and rollback regime
that applies to "root-x86-64:fedora.35" which contains usr/ var/ etc/
A rollback of root does not rollback the systemd journal, VM images,
or flatpaks.

Is space a valid separator in the name of the subvolume? Or
underscore? This would become / to define the path to the mount point.

Additionally, I'm noticing that none of 'journalctl -o verbose' or
json or export shows what subvolume was mounted at each mount point. I
need to use systemd debug for this information to be included in the
journal. Assembly of versioned roots is probably useful logging
information by default. e.g.

Dec 10 10:45:00 fovo.local systemd[1]: Mounting
'@auto/root-x86-64:fedora.35' at /sysroot...
Dec 10 10:45:11 fovo.local systemd[1]: Mounting '@auto/home' at /home...
Dec 10 10:45:11 fovo.local systemd[1]: Mounting '@auto/varlibflatpak'
at /var/lib/flatpak...
Dec 10 10:45:11 fovo.local systemd[1]: Mounting
'@auto/varlibvirtimages-x86-64:fedora.35 at /var/lib/libvirt/images...
Dec 10 10:45:11 fovo.local systemd[1]: Mounting
'@auto/varlog-x86-64:fedora.35' at /var/log...
Dec 10 10:45:11 fovo.local systemd[1]: Mounting '@auto/swap' at /var/swap...


-- 
Chris Murphy


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-10 Thread Chris Murphy
On Thu, Nov 11, 2021 at 12:28 PM Lennart Poettering
 wrote:

> That said: naked squashfs sucks. Always wrap your squashfs in a GPT
> wrapper to make things self-descriptive.

Do you mean the image file contains a GPT, and the squashfs is a
partition within the image? Does this recommendation apply to any
image? Let's say it's a Btrfs image. And in the context of this
thread, the GPT partition type GUID would be the "super-root" GUID?





-- 
Chris Murphy


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-10 Thread Chris Murphy
On Tue, Nov 9, 2021 at 8:48 AM Ludwig Nussel  wrote:
>
> Lennart Poettering wrote:
> > Or to say this explicitly: we could define the spec to say that if
> > we encounter:
> >
> >/@auto/root-x86-64:fedora_36.0+3-0
> >
> > on first boot attempt we'd rename it:
> >
> >/@auto/root-x86-64:fedora_36.0+2-1
> >
> > and so on. Until boot succeeds in which case we'd rename it:
> >
> >/@auto/root-x86-64:fedora_36.0
> >
> > i.e. we'd drop the counting suffix.
>
> Thanks for the explanation and pointer!
>
> Need to think aloud a bit :-)
>
> That method basically works for systems with read-only root. Ie where
> the next OS to boot is in a separate snapshot, eg MicroOS.
> A traditional system with rw / on btrfs would stay on the same subvolume
> though. Ie the "root-x86-64:fedora_36.0" volume in the example. In
> openSUSE package installation automatically leads to ro snapshot
> creation. In order to fit in I suppose those could then be named eg.
> "root-x86-64:fedora_36.N+0" with increasing N. Due to the +0 the
> subvolume would never be booted.

Yeah the N+0 subvolumes could be read-only snapshots, their purpose is
only to be used as an immutable checkpoint from which to produce
derivatives, read-write subvolumes. But what about the case of being
in a preboot environment, and have no way (yet) to rename or create a
new snapshot to boot, and you need to boot one of these read-only
snapshots? What if the bootloader was smart enough to add the proper
volatile overlay arrangement anytime an N+0 subvolume is chosen for
boot? Is that plausible and useful?


> Anyway, let's assume the ro case and both efi partition and btrfs volume
> use this scheme. That means each time some packages are updated we get a
> new subvolume. After reboot the initrd in the efi partition would try to
> boot that new subvolume. If it reaches systemd-bless-boot.service the
> new subvolume becomes the default for the future.
>
> So far so good. What if I discover later that something went wrong
> though? Some convenience tooling to mark the current version bad again
> would be needed.
>
> But then having Tumbleweed in mind it needs some capability to boot any
> old snapshot anyway. I guess the solution here would be to just always
> generate a bootloader entry, independent of whether a kernel was
> included in an update. Each entry would then have to specify kernel,
> initrd and the root subvolume to use.

The part I'm having a hard time separating is the implicit case (use
some logic to assemble the correct objects), versus explicit (the
bootloader snippet points to a root and the root contains an fstab -
nothing about assembly is assumed). And should both paradigms exist
concurrently in an installed system, and how to deconflict?

Further, (open)SUSE tends to define the root to boot via `btrfs
subvolume set-default` which is information in the file system itself,
neither in the bootloader snipper nor in the naming convention. It's
neat, but also not discoverable. If users are trying to
learn+understand+troubleshoot how systems boot and assemble
themselves, to what degree are they owed transparency without needing
extra tools or decoder rings to reveal settings? The default subvolume
is uniquely btrfs, and without an equivalent anywhere else (so far as
I'm aware) I'm reluctant to use that for day to day boots. I can see
the advantage of this for btrfs for some sort of rescue/emergency boot
subvolume however...  where it doesn't contain the parameter
"rootflags=subvol=$root" (which acts as an override for the default
subvolume set in the fs itself) then the btrfs default subvolume would
be used. I'm struggling with its role in all of this though.


-- 
Chris Murphy


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-12-10 Thread Chris Murphy
On Fri, Nov 19, 2021 at 4:17 AM Lennart Poettering
 wrote:
>
> On Do, 18.11.21 14:51, Chris Murphy (li...@colorremedies.com) wrote:
>
> > How to do swapfiles?
>
> Is this really a concept that deserves too much attention?

*shrug* Only insofar as I like order, and like the idea of agreeing on
where things belong if there's going to appear somewhere.

> I mean, I
> have the suspicion that half the benefit of swap space is that it can
> act as backing store for hibernation.

Yes and that's a terrible conflation. The swapfile/device is for
anonymous pages. And hibernation images are not anon pages, and even
have special rules like must be contained in contiguous physical
device blocks. It may turn out that 'swsusp' (Swap Suspend) in the
kernel shouldn't be deprecated, and instead focus future effort on
'uswsusp'. But discussions around signed and authenticated hibernation
images for UEFI Secure Boot and kernel lockdown compatibility, have
all been around the kernel implementation.

https://www.kernel.org/doc/Documentation/power/swsusp.rst
https://www.kernel.org/doc/Documentation/power/userland-swsusp.rst


> But swap files are icky for that
> since that means the resume code has to mount the fs first, but given
> the fs is dirty during the hibernation state this is highly problematic.

It's sufficiently complicated and non-fail-safe (it's fail danger)
that it's broken. On btrfs, it's more tedious but less broken because
you must use both

resume=UUID=$uuid resume_offset=$physicaloffsethibernationimage

In effect the kernel does not need to mount ro the btrfs file system
at all, it gets the hint for the physical location of the hibernation
image from kernel boot parameter. Other file systems support discovery
of the physical offset once the file system is mounted ro. On Btrfs
you can see the swapfile as having a punch through mechanism. It's a
reservation of blocks, and page outs happen directly to that
reservation of blocks, not via the file system itself. This is why
there are all these limitations: balance doesn't touch block groups
containing any swapfile blocks, you can't do any kind of multiple
device stuff, you can't snapshot/reflink the swapfile, etc.

Which is why I'm in favor of just ceding this entire territory over to
systemd to manage correctly. But as a prerequisite, the hibernation
image should be separate from the swapfile. And should have a metadata
format so we can pair file system state to hibernation image state,
that way for sure we aren't running into catastrophic nonsense like
this right at the top of
https://www.kernel.org/doc/Documentation/power/swsusp.rst

   **BIG FAT WARNING**

   If you touch anything on disk between suspend and resume...
...kiss your data goodbye.

   If you do resume from initrd after your filesystems are mounted...
...bye bye root partition.

Horrible.

> Hence, I have the suspicion that if you do swap you should probably do
> swap partitions, not swap files, because it can cover all usecase:
> paging *and* hibernation.

I agree only insofar as it's the most reliable thing we have right
now. Not that it's an efficient or safe design, you still can have
problems if you rw mount a file system, and then resume from a
hibernation image. The kernel has no concept of matching a file system
state to that of a hibernation image, so that the hibernation image
can be invalidated, thus avoiding subsequent corruption.

> > Currently I'm creating a "swap" subvolume in the top-level of the file
> > system and /etc/fstab looks like this
> >
> > UUID=$FSUUID/var/swap   btrfs   noatime,subvol=swap 0 0
> > /var/swap/swapfile1 none swap defaults 0 0
> >
> > This seems to work reliably after hundreds of boots.
> >
> > a. Is this naming convention for the subvolume adequate? Seems like it
> > can just be "swap" because the GPT method is just a single partition
> > type GUID that's shared by multiboot Linux setups, i.e. not arch or
> > distro specific
>
> I'd still put it one level down, and marke it with some non-typical
> character so that it is less likely to clash with anything else.

I'm not sure I understand "one level down". The "swap" subvolume would
be in the top-level of the Btrfs file system, just like Fedora's
existing "root" and "home" subvolumes are in the top level.

>
> > b. Is the mount point, /var/swap, OK?
>
> I see no reason why not.

OK super.


>
> > c. What should the additional naming convention be for the swapfile
> > itself so swapon happens automatically?
>
> To me it appears these things should be distinct: if automatic
> activation of swap files is desirable, then there should probably be a
> systemd generator that finds all suitable files in /var/swap/ and
> generates .swap units for them. This would then work with any kind of
> setup, i.e. independently of the btrfs auto-discovery stuff. The other
> thing would be the btrfs auto-disocvery to then actually mount
> something there automatically.

I think it's desirable only because users 

Re: [systemd-devel] Antw: [EXT] Re: [systemd‑devel] the need for a discoverable sub‑volumes specification

2021-12-10 Thread Chris Murphy
On Mon, Nov 22, 2021 at 3:02 AM Ulrich Windl
 wrote:
>
> >>> Lennart Poettering  schrieb am 19.11.2021 um 10:17
> in
> Nachricht :
> > On Do, 18.11.21 14:51, Chris Murphy (li...@colorremedies.com) wrote:
> >
> >> How to do swapfiles?
> >
> > Is this really a concept that deserves too much attention? I mean, I
> > have the suspicion that half the benefit of swap space is that it can
> > act as backing store for hibernation. But swap files are icky for that
> > since that means the resume code has to mount the fs first, but given
> > the fs is dirty during the hibernation state this is highly problematic.
> >
> > Hence, I have the suspicion that if you do swap you should probably do
> > swap partitions, not swap files, because it can cover all usecase:
> > paging *and* hibernation.
>
> Out of curiosity: What about swap LVs, possibly thin-provisioned ones?

I don't think that's supported.
https://listman.redhat.com/archives/linux-lvm/2020-November/msg00039.html


-- 
Chris Murphy


[systemd-devel] Antw: [EXT] Re: [systemd‑devel] the need for a discoverable sub‑volumes specification

2021-11-22 Thread Ulrich Windl
>>> Lennart Poettering  schrieb am 19.11.2021 um 10:17
in
Nachricht :
> On Do, 18.11.21 14:51, Chris Murphy (li...@colorremedies.com) wrote:
> 
>> How to do swapfiles?
> 
> Is this really a concept that deserves too much attention? I mean, I
> have the suspicion that half the benefit of swap space is that it can
> act as backing store for hibernation. But swap files are icky for that
> since that means the resume code has to mount the fs first, but given
> the fs is dirty during the hibernation state this is highly problematic.
> 
> Hence, I have the suspicion that if you do swap you should probably do
> swap partitions, not swap files, because it can cover all usecase:
> paging *and* hibernation.

Out of curiosity: What about swap LVs, possibly thin-provisioned ones?

> 
>> Currently I'm creating a "swap" subvolume in the top‑level of the file
>> system and /etc/fstab looks like this
>>
>> UUID=$FSUUID/var/swap   btrfs   noatime,subvol=swap 0 0
>> /var/swap/swapfile1 none swap defaults 0 0
>>
>> This seems to work reliably after hundreds of boots.
>>
>> a. Is this naming convention for the subvolume adequate? Seems like it
>> can just be "swap" because the GPT method is just a single partition
>> type GUID that's shared by multiboot Linux setups, i.e. not arch or
>> distro specific
> 
> I'd still put it one level down, and marke it with some non‑typical
> character so that it is less likely to clash with anything else.
> 
>> b. Is the mount point, /var/swap, OK?
> 
> I see no reason why not.
> 
>> c. What should the additional naming convention be for the swapfile
>> itself so swapon happens automatically?
> 
> To me it appears these things should be distinct: if automatic
> activation of swap files is desirable, then there should probably be a
> systemd generator that finds all suitable files in /var/swap/ and
> generates .swap units for them. This would then work with any kind of
> setup, i.e. independently of the btrfs auto‑discovery stuff. The other
> thing would be the btrfs auto‑disocvery to then actually mount
> something there automatically.
> 
>> Also, instead of /@auto/ I'm wondering if we could have
>> /x‑systemd.auto/ ? This makes it more clearly systemd's namespace, and
>> while I'm a big fan of the @ symbol for typographic history reasons,
>> it's being used in the subvolume/snapshot regimes rather haphazardly
>> for different purposes which might be confusing? e.g. Timeshift
>> expects subvolumes it manages to be prefixed with @. Meanwhile SUSE
>> uses @ for its (visible) root subvolume in which everything else goes.
>> And still ZFS uses @ for their (read‑only) snapshots.
> 
> I try to keep the "systemd" name out of entirely generic specs, since
> there are some people who have an issue with that. i.e. this way we
> tricked even Devuan to adopt /etc/os‑release and the /run/ hierarchy,
> since they probably aren't even aware that these are systemd things.
> 
> Other chars could be used too: /+auto/ sounds OK to me too. or
> /_auto/, or /=auto/ or so.
> 
> Lennart
> 
> ‑‑
> Lennart Poettering, Berlin





Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-19 Thread Lennart Poettering
On Do, 18.11.21 15:01, Chris Murphy (li...@colorremedies.com) wrote:

> On Thu, Nov 18, 2021 at 2:51 PM Chris Murphy  wrote:
> >
> > How to do swapfiles?
> >
> > Currently I'm creating a "swap" subvolume in the top-level of the file
> > system and /etc/fstab looks like this
> >
> > UUID=$FSUUID/var/swap   btrfs   noatime,subvol=swap 0 0
> > /var/swap/swapfile1 none swap defaults 0 0
> >
> > This seems to work reliably after hundreds of boots.
> >
> > a. Is this naming convention for the subvolume adequate? Seems like it
> > can just be "swap" because the GPT method is just a single partition
> > type GUID that's shared by multiboot Linux setups, i.e. not arch or
> > distro specific
> > b. Is the mount point, /var/swap, OK?
> > c. What should the additional naming convention be for the swapfile
> > itself so swapon happens automatically?
>
> Actually I'm thinking of something different suddenly... because
> without user ownership of swapfiles, and instead systemd having domain
> over this, it's perhaps more like:
>
> /x-systemd.auto/swap -> /run/systemd/swap

I'd be conservative with mounting disk stuff to /run/. We do this for
removable disks because the mount points are kinda dynamic, hence it
makes sense, but for this case it sounds unnecessary, /var/swap sounds
fine to me, in particular as the /var/ partition actually sounds like
the right place to it if /var/swap/ is not a mount point in itself but
just a plain subdir.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-19 Thread Lennart Poettering
On Do, 18.11.21 14:51, Chris Murphy (li...@colorremedies.com) wrote:

> How to do swapfiles?

Is this really a concept that deserves too much attention? I mean, I
have the suspicion that half the benefit of swap space is that it can
act as backing store for hibernation. But swap files are icky for that
since that means the resume code has to mount the fs first, but given
the fs is dirty during the hibernation state this is highly problematic.

Hence, I have the suspicion that if you do swap you should probably do
swap partitions, not swap files, because it can cover all usecase:
paging *and* hibernation.

> Currently I'm creating a "swap" subvolume in the top-level of the file
> system and /etc/fstab looks like this
>
> UUID=$FSUUID/var/swap   btrfs   noatime,subvol=swap 0 0
> /var/swap/swapfile1 none swap defaults 0 0
>
> This seems to work reliably after hundreds of boots.
>
> a. Is this naming convention for the subvolume adequate? Seems like it
> can just be "swap" because the GPT method is just a single partition
> type GUID that's shared by multiboot Linux setups, i.e. not arch or
> distro specific

I'd still put it one level down, and marke it with some non-typical
character so that it is less likely to clash with anything else.

> b. Is the mount point, /var/swap, OK?

I see no reason why not.

> c. What should the additional naming convention be for the swapfile
> itself so swapon happens automatically?

To me it appears these things should be distinct: if automatic
activation of swap files is desirable, then there should probably be a
systemd generator that finds all suitable files in /var/swap/ and
generates .swap units for them. This would then work with any kind of
setup, i.e. independently of the btrfs auto-discovery stuff. The other
thing would be the btrfs auto-disocvery to then actually mount
something there automatically.

> Also, instead of /@auto/ I'm wondering if we could have
> /x-systemd.auto/ ? This makes it more clearly systemd's namespace, and
> while I'm a big fan of the @ symbol for typographic history reasons,
> it's being used in the subvolume/snapshot regimes rather haphazardly
> for different purposes which might be confusing? e.g. Timeshift
> expects subvolumes it manages to be prefixed with @. Meanwhile SUSE
> uses @ for its (visible) root subvolume in which everything else goes.
> And still ZFS uses @ for their (read-only) snapshots.

I try to keep the "systemd" name out of entirely generic specs, since
there are some people who have an issue with that. i.e. this way we
tricked even Devuan to adopt /etc/os-release and the /run/ hierarchy,
since they probably aren't even aware that these are systemd things.

Other chars could be used too: /+auto/ sounds OK to me too. or
/_auto/, or /=auto/ or so.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-18 Thread Chris Murphy
On Thu, Nov 18, 2021 at 2:51 PM Chris Murphy  wrote:
>
> How to do swapfiles?
>
> Currently I'm creating a "swap" subvolume in the top-level of the file
> system and /etc/fstab looks like this
>
> UUID=$FSUUID/var/swap   btrfs   noatime,subvol=swap 0 0
> /var/swap/swapfile1 none swap defaults 0 0
>
> This seems to work reliably after hundreds of boots.
>
> a. Is this naming convention for the subvolume adequate? Seems like it
> can just be "swap" because the GPT method is just a single partition
> type GUID that's shared by multiboot Linux setups, i.e. not arch or
> distro specific
> b. Is the mount point, /var/swap, OK?
> c. What should the additional naming convention be for the swapfile
> itself so swapon happens automatically?

Actually I'm thinking of something different suddenly... because
without user ownership of swapfiles, and instead systemd having domain
over this, it's perhaps more like:

/x-systemd.auto/swap -> /run/systemd/swap

And then systemd just manages the files in that directory per policy,
e.g. do on demand creation of swapfiles with variable size increments,
as well as cleanup.


-- 
Chris Murphy


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-18 Thread Chris Murphy
How to do swapfiles?

Currently I'm creating a "swap" subvolume in the top-level of the file
system and /etc/fstab looks like this

UUID=$FSUUID/var/swap   btrfs   noatime,subvol=swap 0 0
/var/swap/swapfile1 none swap defaults 0 0

This seems to work reliably after hundreds of boots.

a. Is this naming convention for the subvolume adequate? Seems like it
can just be "swap" because the GPT method is just a single partition
type GUID that's shared by multiboot Linux setups, i.e. not arch or
distro specific
b. Is the mount point, /var/swap, OK?
c. What should the additional naming convention be for the swapfile
itself so swapon happens automatically?


Also, instead of /@auto/ I'm wondering if we could have
/x-systemd.auto/ ? This makes it more clearly systemd's namespace, and
while I'm a big fan of the @ symbol for typographic history reasons,
it's being used in the subvolume/snapshot regimes rather haphazardly
for different purposes which might be confusing? e.g. Timeshift
expects subvolumes it manages to be prefixed with @. Meanwhile SUSE
uses @ for its (visible) root subvolume in which everything else goes.
And still ZFS uses @ for their (read-only) snapshots.

--
Chris Murphy


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-11 Thread Topi Miettinen

On 11.11.2021 19.27, Lennart Poettering wrote:

On Mi, 10.11.21 10:34, Topi Miettinen (toiwo...@gmail.com) wrote:


Doing this RootDirectory= would make a ton of sense too I guess, but
it's not as obvious there: we'd need to extend the setting a bit I
think to explicitly enable this logic. As opposed to the RootImage=
case (where the logic should be default on) I think any such logic for
RootDirectory= should be opt-in for security reasons because we cannot
safely detect environments where this logic is desirable and discern
them from those where it isn't. In RootImage= we can bind this to the
right GPT partition type being used to mark root file systems that are
arranged for this kind of setup. But in RootDirectory= we have no
concept like that and the stuff inside the image is (unlike a GPT
partition table) clearly untrusted territory, if you follow what I am
babbling.


My images don't have GPT partition tables, they are just raw squashfs file
systems. So I'd prefer a way to identify the version either by contents of
the image (/@auto/ directory), or something external, like name of the image
(/path/to/image/foo.version-X.Y). Either option would be easy to implement
when generating the image or directory.


Hmm, so thinking about this again, I think we might get away with a
check "/@auto/ exists and /usr/ does not". i.e. the second part of the
check removes any ambiguity: since we unified the OS in /usr it's an
excellent way to check if something is or could be an OS tree.

That said: naked squashfs sucks. Always wrap your squashfs in a GPT
wrapper to make things self-descriptive.


It would be an extra step in image making process. I think naming the 
images '*.squashfs' documents them well enough for me.



But if you have several RootDirectories or RootImages available for a
service, what would be the way to tell which ones should be tried if there's
no GPT? They can't all have the same name. I think using a specifier (like
%q) would solve this issue nicely and there wouldn't be a need for /@auto/
in that case.


A specifier is resolved at unit file load time only. It wouldn#t be the
right fit here, since we don#t want to require that the paths
specified in RootDirectory=/RootImage= are already accessible at the
time PID 1 reads/parses the unit file.


Well, that's out then.


What about this: we could entirely independently of the proposal
originally discussed here teach RootDirectory= + RootImage= one magic
trick: if the path specified ends in ".auto.d/" (or so) then we'll not
actually use the dir/image as-is but assume the path refers to a
directory, and we'd pick the newest entry inside it as decided by
strverscmp().

Or in other words, we'd establish the general rule that dirs ending in
".auto.d/" contains versioned resources inside, that we could apply
here and everywhere else where it fits, too.

of course intrdocuing this rule would be kind of a compat breakage
because if anyone happened to have named their dirs like that already
we'd suddenly do weird stuff with it the user might not expect. But I
think I could live with that.

A patch for that should be pretty easy to do, and be very generically
useful. I kinda like it. What do you think?


Yes, that could work. I'd need to rename the LVM VG to 'levy.auto.d' 
(maybe instead create a new VG just for these images) and the directory 
too but that's no problem.


-Topi


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-11 Thread Lennart Poettering
On Do, 11.11.21 18:27, Lennart Poettering (mzerq...@0pointer.de) wrote:

> A patch for that should be pretty easy to do, and be very generically
> useful. I kinda like it. What do you think?

For now I added TODO list items for these ideas:

https://github.com/systemd/systemd/commit/af11e0ef843c19cbf8ccaefb93a44dbe4602f7a8#diff-337e547a950fc8a98592f10d964c1e79a304961790a8da0ce449a1f000cefabb

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-11 Thread Lennart Poettering
On Mi, 10.11.21 10:34, Topi Miettinen (toiwo...@gmail.com) wrote:

> > Doing this RootDirectory= would make a ton of sense too I guess, but
> > it's not as obvious there: we'd need to extend the setting a bit I
> > think to explicitly enable this logic. As opposed to the RootImage=
> > case (where the logic should be default on) I think any such logic for
> > RootDirectory= should be opt-in for security reasons because we cannot
> > safely detect environments where this logic is desirable and discern
> > them from those where it isn't. In RootImage= we can bind this to the
> > right GPT partition type being used to mark root file systems that are
> > arranged for this kind of setup. But in RootDirectory= we have no
> > concept like that and the stuff inside the image is (unlike a GPT
> > partition table) clearly untrusted territory, if you follow what I am
> > babbling.
>
> My images don't have GPT partition tables, they are just raw squashfs file
> systems. So I'd prefer a way to identify the version either by contents of
> the image (/@auto/ directory), or something external, like name of the image
> (/path/to/image/foo.version-X.Y). Either option would be easy to implement
> when generating the image or directory.

Hmm, so thinking about this again, I think we might get away with a
check "/@auto/ exists and /usr/ does not". i.e. the second part of the
check removes any ambiguity: since we unified the OS in /usr it's an
excellent way to check if something is or could be an OS tree.

That said: naked squashfs sucks. Always wrap your squashfs in a GPT
wrapper to make things self-descriptive.

> But if you have several RootDirectories or RootImages available for a
> service, what would be the way to tell which ones should be tried if there's
> no GPT? They can't all have the same name. I think using a specifier (like
> %q) would solve this issue nicely and there wouldn't be a need for /@auto/
> in that case.

A specifier is resolved at unit file load time only. It wouldn#t be the
right fit here, since we don#t want to require that the paths
specified in RootDirectory=/RootImage= are already accessible at the
time PID 1 reads/parses the unit file.

What about this: we could entirely independently of the proposal
originally discussed here teach RootDirectory= + RootImage= one magic
trick: if the path specified ends in ".auto.d/" (or so) then we'll not
actually use the dir/image as-is but assume the path refers to a
directory, and we'd pick the newest entry inside it as decided by
strverscmp().

Or in other words, we'd establish the general rule that dirs ending in
".auto.d/" contains versioned resources inside, that we could apply
here and everywhere else where it fits, too.

of course intrdocuing this rule would be kind of a compat breakage
because if anyone happened to have named their dirs like that already
we'd suddenly do weird stuff with it the user might not expect. But I
think I could live with that.

A patch for that should be pretty easy to do, and be very generically
useful. I kinda like it. What do you think?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-10 Thread Topi Miettinen

On 9.11.2021 23.03, Lennart Poettering wrote:

On Di, 09.11.21 19:48, Topi Miettinen (toiwo...@gmail.com) wrote:


i.e. we'd drop the counting suffix.


Could we have this automatic versioning scheme extended also to service
RootImages & RootDirectories as well? If the automatic versioning was also
extended to services, we could have A/B testing also for RootImages with
automatic fallback to last known good working version.


At least in the case of RootImage= this was my implied assumption:
we'd implement the same there, since that uses the exact same code as
systemd-nspawn's image dissection and we definitely want it there.

Doing this RootDirectory= would make a ton of sense too I guess, but
it's not as obvious there: we'd need to extend the setting a bit I
think to explicitly enable this logic. As opposed to the RootImage=
case (where the logic should be default on) I think any such logic for
RootDirectory= should be opt-in for security reasons because we cannot
safely detect environments where this logic is desirable and discern
them from those where it isn't. In RootImage= we can bind this to the
right GPT partition type being used to mark root file systems that are
arranged for this kind of setup. But in RootDirectory= we have no
concept like that and the stuff inside the image is (unlike a GPT
partition table) clearly untrusted territory, if you follow what I am
babbling.


My images don't have GPT partition tables, they are just raw squashfs 
file systems. So I'd prefer a way to identify the version either by 
contents of the image (/@auto/ directory), or something external, like 
name of the image (/path/to/image/foo.version-X.Y). Either option would 
be easy to implement when generating the image or directory.


But if you have several RootDirectories or RootImages available for a 
service, what would be the way to tell which ones should be tried if 
there's no GPT? They can't all have the same name. I think using a 
specifier (like %q) would solve this issue nicely and there wouldn't be 
a need for /@auto/ in that case.



Or in other words: to enable this for RootDirectory= we probably need
a new option RootDirectoryVersioned= or so that takes a boolean.


Wouldn't this be unnecessary, if the version magic would be available 
explicitly as specifier to the path of RootDirectory= or RootImage=? 
Then we know that the configuring user made this decision.


-Topi


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-09 Thread Lennart Poettering
On Di, 09.11.21 19:48, Topi Miettinen (toiwo...@gmail.com) wrote:

> > i.e. we'd drop the counting suffix.
>
> Could we have this automatic versioning scheme extended also to service
> RootImages & RootDirectories as well? If the automatic versioning was also
> extended to services, we could have A/B testing also for RootImages with
> automatic fallback to last known good working version.

At least in the case of RootImage= this was my implied assumption:
we'd implement the same there, since that uses the exact same code as
systemd-nspawn's image dissection and we definitely want it there.

Doing this RootDirectory= would make a ton of sense too I guess, but
it's not as obvious there: we'd need to extend the setting a bit I
think to explicitly enable this logic. As opposed to the RootImage=
case (where the logic should be default on) I think any such logic for
RootDirectory= should be opt-in for security reasons because we cannot
safely detect environments where this logic is desirable and discern
them from those where it isn't. In RootImage= we can bind this to the
right GPT partition type being used to mark root file systems that are
arranged for this kind of setup. But in RootDirectory= we have no
concept like that and the stuff inside the image is (unlike a GPT
partition table) clearly untrusted territory, if you follow what I am
babbling.

Or in other words: to enable this for RootDirectory= we probably need
a new option RootDirectoryVersioned= or so that takes a boolean.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-09 Thread Lennart Poettering
On Di, 09.11.21 14:48, Ludwig Nussel (ludwig.nus...@suse.de) wrote:

> > and so on. Until boot succeeds in which case we'd rename it:
> >
> >/@auto/root-x86-64:fedora_36.0
> >
> > i.e. we'd drop the counting suffix.
>
> Thanks for the explanation and pointer!
>
> Need to think aloud a bit :-)
>
> That method basically works for systems with read-only root. Ie where
> the next OS to boot is in a separate snapshot, eg MicroOS.
> A traditional system with rw / on btrfs would stay on the same subvolume
> though. Ie the "root-x86-64:fedora_36.0" volume in the example. In
> openSUSE package installation automatically leads to ro snapshot
> creation. In order to fit in I suppose those could then be named eg.
> "root-x86-64:fedora_36.N+0" with increasing N. Due to the +0 the
> subvolume would never be booted.
>
> Anyway, let's assume the ro case and both efi partition and btrfs volume
> use this scheme. That means each time some packages are updated we get a
> new subvolume. After reboot the initrd in the efi partition would try to
> boot that new subvolume. If it reaches systemd-bless-boot.service the
> new subvolume becomes the default for the future.
>
> So far so good. What if I discover later that something went wrong
> though? Some convenience tooling to mark the current version bad again
> would be needed.

In the sd-boot/kernel case any time you like you can rename an entry
to "…+0" to mark it as "bad", you could drop the suffix to mark it as
"good" or you could mark it as "+3" to mark it as
"dont-know/try-again".

Now, at least in theory we could declare the same for this new
directory auto-discovery scheme. But I am not entirely sure this will
work out trivially IRL because I have the suspicion one cannot rename
subvolumes which are the source of a bind mount (i.e. once you boot
into one root subtree, then it might be impossible to rename that
top-level inode without rebooting first). Would be something to try
out. If it doesn't work it might suffice to move things one level
down, i.e. that the dir that actually becomes root is
/@auto/root-x86-64:fedora_36.0/payload/ or so, instead of just
/@auto/root-x86-64:fedora_36.0/. I think that that would work, and
might be desirable anyway so that the enumeration of entries doesn't
already leak fs attributes/ownership/access modes/…  of actual root
fs.

> But then having Tumbleweed in mind it needs some capability to boot any
> old snapshot anyway. I guess the solution here would be to just always
> generate a bootloader entry, independent of whether a kernel was
> included in an update. Each entry would then have to specify kernel,
> initrd and the root subvolume to use.
> This approach would work with a separate usr volume also. In that case
> kernel, initrd, root and usr volume need to be linked by means of a
> bootloader entry.

For the GPT case if you want to bind a kernel together with a specific
root fs, you'd do this by specifying 'root=PARTLABEL=fooos_0.3' on the
kernel cmdline. I'd take inspiration from that and maybe introduce
'rootentry=fedora_36.2' or so which would then be honoured by the logic
we are discussing here, and would hard override which subdir to use,
regardless of versioning preference, assesment counting and so on.

(Yeah, the subvol= mount option for btrfs would work too, but as
mentioned I'd keep this reasonably independent of btrfs where its
easy, plain dirs otherwise are fine too after all. Which reminds me,
recent util-linux implements the X-mount.subdir= mount option, which
means one could also use 'rootflags=X-mount.subdir=@auto/fedora_36.2'
as non-btrfs-specific way to express the btrfs-specific
'rootflags=subvol=@auto/fedora_36.2')

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-09 Thread Topi Miettinen

On 8.11.2021 17.32, Lennart Poettering wrote:

Besides the GPT auto-discovery where versioning is implemented the way
I mentioned, there's also the sd-boot boot loader which does roughly
the same kind of OS versioning with the boot entries it discovers. So
right now, you can already chose whether:

1. you want to do OS versioning on the boot loader entry level: name
your EFI binary fooos-0.1.efi (or fooos-0.1.conf, as defined by the
boot loader spec) and similar and the boot loader automatically
picks it up, makes sense of it and boots the newest version
installed.

2. you want to do OS versioning on the GPT partition table level: name
your partitions "fooos-0.1" and similar, with the right GPT type,
and tools such as systemd-nspawn, systemd-dissect, portable
services, RootImage= in service unit files all will be able to
automatically pick the newest version of the OS among the ones in
the image.

and now:

3. If we implement what I proprose above then you could do OS version
on the file system level too.

(Or you could do a combination of the above, if you want — which is
highly desirable I think in case you want a universal image that can
boot on bare metal and in nspawn in a nice versioned way.)

Now, in sd-boot's versioning logic we implement an automatic boot
assesment logic on top of the OS versioning: if you add a "+x-y"
string into the boot entry name we use it as x=tries-left and
y=tries-done counters. i.e. fooos-0.1+3-0.efi is semantically the same
as fooos-0.1.efi, except that there are 3 attempts left and 0 done
yet. On each boot attempt the boot loader decreases x and increases
y. i.e. fooos-0.1+3-0.efi → fooos-0.1+2-1.efi → fooos-0.1+1-2.efi →
fooos-0.1+0-3.efi. If a boot succeeds the two counters are dropped
from the filename, i.e. → fooos-0.1.efi.

For details see: https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT.

Now, why am I mentioning all this? Right now this assessment counter
logic is only implemented for the OS versioning as implemented by
sd-boot. But I think it would make a ton of sense to implement the
same scheme for the GPT partition table OS versioning, and then also
for the fs-level OS versioning as proposed in this thread.

Or to say this explicitly: we could define the spec to say that if
we encounter:

/@auto/root-x86-64:fedora_36.0+3-0

on first boot attempt we'd rename it:

/@auto/root-x86-64:fedora_36.0+2-1

and so on. Until boot succeeds in which case we'd rename it:

/@auto/root-x86-64:fedora_36.0

i.e. we'd drop the counting suffix.


Could we have this automatic versioning scheme extended also to service 
RootImages & RootDirectories as well? If the automatic versioning was 
also extended to services, we could have A/B testing also for RootImages 
with automatic fallback to last known good working version.


In my setup, all services use either a RootImage= or RootDirectory= (for 
early boot services). Most of them don't care about kernel version, so 
the services use a shared drop-in (LVM logical volume 'levy'):


[Service]
RootImage=/dev/levy/%p-all.squashfs

The device path will then be for example 
/dev/levy/systemd-networkd-all.squashfs.


For udev and systemd-modules, kernel version is used 
(/usr/local/lib/rootimages/systemd-udevd-5.14.0-2-amd64.dir), so the 
services use this drop-in:


[Service]
RootDirectory=/usr/local/lib/rootimages/%p-%v.dir

Instead of (or in addition to) /@auto/ paths inside the RootImage= / 
RootDirectory=, the version could be available as modifier to part of 
device or directory pathname, for example:


[Service]
RootImage=/dev/levy/%p-all-@auto.squashfs

or

[Service]
RootImage=/usr/local/lib/rootimages/%p-%v-@auto.squashfs

Maybe %a instead of @auto.

This would then match 
/dev/levy/systemd-networkd-all-2021-11.09.0.squashfs as the highest 
version, but if that refuses to start, PID1 would try to start 
/dev/levy/systemd-networkd-all-2021-11.08.2.squashfs instead.


-Topi


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-09 Thread Ludwig Nussel
Lennart Poettering wrote:
> On Mo, 08.11.21 14:24, Ludwig Nussel (ludwig.nus...@suse.de) wrote:
> [...]
>> MicroOS has a similar situation. It edits /etc/fstab.
> 
> microoos is a suse thing?

Yeah. https://get.opensuse.org/microos/
It uses regular package management but instead of installing rpms in the
running system (which is read-only) it does so in a btrfs snapshot.

>> Anyway in the above example I guess if you install some updates you'd
>> get eg root-x86-64:fedora_37.2, .3, .4 etc?
> 
> [...]
> 
> The GPT auto-discovery thing basically does an strverscmp() on the
> full GPT partition label string, i.e. it does not attempt to split a
> name from a version, but assumes strverscmp() will handle a common
> prefix nicely anyway. I'd do it the exact same way here: if there are
> multiple options, then pick the newest as per strverscmp(), but that
> also means it's totally fine to not version your stuff and instead of
> calling it "root-x86-64:fedora_37.3" could could also just name it
> "root-x86-64:fedora" if you like, and then not have any versioning.

Nice. Means it might even work with just "root" for systems that get
installed the traditional way and no intention to move the hard disk around.

>> I suppose the autodetection is meant to boot the one sorted last. What
>> if that one turns out to be bad though? How to express rollback in that
>> model?
> 
> Besides the GPT auto-discovery where versioning is implemented the way
> I mentioned, there's also the sd-boot boot loader which does roughly
> the same kind of OS versioning with the boot entries it discovers> [...]
> For details see: https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT.
> [...]
> Or to say this explicitly: we could define the spec to say that if
> we encounter:
> 
>/@auto/root-x86-64:fedora_36.0+3-0
> 
> on first boot attempt we'd rename it:
> 
>/@auto/root-x86-64:fedora_36.0+2-1
> 
> and so on. Until boot succeeds in which case we'd rename it:
> 
>/@auto/root-x86-64:fedora_36.0
> 
> i.e. we'd drop the counting suffix.

Thanks for the explanation and pointer!

Need to think aloud a bit :-)

That method basically works for systems with read-only root. Ie where
the next OS to boot is in a separate snapshot, eg MicroOS.
A traditional system with rw / on btrfs would stay on the same subvolume
though. Ie the "root-x86-64:fedora_36.0" volume in the example. In
openSUSE package installation automatically leads to ro snapshot
creation. In order to fit in I suppose those could then be named eg.
"root-x86-64:fedora_36.N+0" with increasing N. Due to the +0 the
subvolume would never be booted.

Anyway, let's assume the ro case and both efi partition and btrfs volume
use this scheme. That means each time some packages are updated we get a
new subvolume. After reboot the initrd in the efi partition would try to
boot that new subvolume. If it reaches systemd-bless-boot.service the
new subvolume becomes the default for the future.

So far so good. What if I discover later that something went wrong
though? Some convenience tooling to mark the current version bad again
would be needed.

But then having Tumbleweed in mind it needs some capability to boot any
old snapshot anyway. I guess the solution here would be to just always
generate a bootloader entry, independent of whether a kernel was
included in an update. Each entry would then have to specify kernel,
initrd and the root subvolume to use.
This approach would work with a separate usr volume also. In that case
kernel, initrd, root and usr volume need to be linked by means of a
bootloader entry.

Means the counter mechanism wouldn't actually be needed on fs or
partition level in practice after all. It's sufficient in the bootloader.

cu
Ludwig

-- 
 (o_   Ludwig Nussel
 //\
 V_/_  http://www.suse.com/
SUSE Software Solutions Germany GmbH, GF: Ivo Totev
HRB 36809 (AG Nürnberg)


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-08 Thread Lennart Poettering
On Mo, 08.11.21 14:24, Ludwig Nussel (ludwig.nus...@suse.de) wrote:

> Lennart Poettering wrote:
> > [...]
> > 3. Inside the "@auto" dir of the "super-root" fs, have dirs named
> >[:]. The type should have a similar vocubulary
> >as the GPT spec type UUIDs, but probably use textual identifiers
> >rater than UUIDs, simply because naming dirs by uuids is
> >weird. Examples:
> >
> >/@auto/root-x86-64:fedora_36.0/
> >/@auto/root-x86-64:fedora_36.1/
> >/@auto/root-x86-64:fedora_37.1/
> >/@auto/home/
> >/@auto/srv/
> >/@auto/tmp/
> >
> >Which would be assembled by the initrd into the following via bind
> >mounts:
> >
> >/ → /@auto/root-x86-64:fedora_37.1/
> >/home/→ /@auto/home/
> >/srv/ → /@auto/srv/
> >/var/tmp/ → /@auto/tmp/
> >
> > If we do this, then we should also leave the door open so that maybe
> > ostree can be hooked up with this, i.e. if we allow the dirs in
> > /@auto/ to actually be symlinks, then they could put their ostree
> > checkotus wherever they want and then create a symlink
> > /@auto/root-x86-64:myostreeos pointing to it, and their image would be
> > spec conformant: we'd boot into that automatically, and so would
> > nspawn and similar things. Thus they could switch their default OS to
> > boot into without patching kernel cmdlines or such, simply by updating
> > that symlink, and vanille systemd would know how to rearrange things.
>
> MicroOS has a similar situation. It edits /etc/fstab.

microoos is a suse thing?

> Anyway in the above example I guess if you install some updates you'd
> get eg root-x86-64:fedora_37.2, .3, .4 etc?

Well, the spec wouldn't mandate that. But yeah, the idea is that you
could do it like that if you want. What's important is to define the
vocabulary to make this easy and possible, but of course, whether
people follow such an update scheme is up to them. I mean, it's the
same as with the GPT auto discovery logic: it already implements such
a versioning scheme because its easy to implement, but if you don't
want to take benefit of the versioning, then don't, it's fine
regardless. the logic we'd define here is about *consuming* available
OS root filesystems, not about *installing* them, after all.

The GPT auto-discovery thing basically does an strverscmp() on the
full GPT partition label string, i.e. it does not attempt to split a
name from a version, but assumes strverscmp() will handle a common
prefix nicely anyway. I'd do it the exact same way here: if there are
multiple options, then pick the newest as per strverscmp(), but that
also means it's totally fine to not version your stuff and instead of
calling it "root-x86-64:fedora_37.3" could could also just name it
"root-x86-64:fedora" if you like, and then not have any versioning.

> I suppose the autodetection is meant to boot the one sorted last. What
> if that one turns out to be bad though? How to express rollback in that
> model?

Besides the GPT auto-discovery where versioning is implemented the way
I mentioned, there's also the sd-boot boot loader which does roughly
the same kind of OS versioning with the boot entries it discovers. So
right now, you can already chose whether:

1. you want to do OS versioning on the boot loader entry level: name
   your EFI binary fooos-0.1.efi (or fooos-0.1.conf, as defined by the
   boot loader spec) and similar and the boot loader automatically
   picks it up, makes sense of it and boots the newest version
   installed.

2. you want to do OS versioning on the GPT partition table level: name
   your partitions "fooos-0.1" and similar, with the right GPT type,
   and tools such as systemd-nspawn, systemd-dissect, portable
   services, RootImage= in service unit files all will be able to
   automatically pick the newest version of the OS among the ones in
   the image.

and now:

3. If we implement what I proprose above then you could do OS version
   on the file system level too.

(Or you could do a combination of the above, if you want — which is
highly desirable I think in case you want a universal image that can
boot on bare metal and in nspawn in a nice versioned way.)

Now, in sd-boot's versioning logic we implement an automatic boot
assesment logic on top of the OS versioning: if you add a "+x-y"
string into the boot entry name we use it as x=tries-left and
y=tries-done counters. i.e. fooos-0.1+3-0.efi is semantically the same
as fooos-0.1.efi, except that there are 3 attempts left and 0 done
yet. On each boot attempt the boot loader decreases x and increases
y. i.e. fooos-0.1+3-0.efi → fooos-0.1+2-1.efi → fooos-0.1+1-2.efi →
fooos-0.1+0-3.efi. If a boot succeeds the two counters are dropped
from the filename, i.e. → fooos-0.1.efi.

For details see: https://systemd.io/AUTOMATIC_BOOT_ASSESSMENT.

Now, why am I mentioning all this? Right now this assessment counter
logic is only implemented for the OS versioning as implemented by
sd-boot. But I think it would make a ton of sense to 

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-08 Thread Ludwig Nussel
Lennart Poettering wrote:
> [...]
> 3. Inside the "@auto" dir of the "super-root" fs, have dirs named
>[:]. The type should have a similar vocubulary
>as the GPT spec type UUIDs, but probably use textual identifiers
>rater than UUIDs, simply because naming dirs by uuids is
>weird. Examples:
> 
>/@auto/root-x86-64:fedora_36.0/
>/@auto/root-x86-64:fedora_36.1/
>/@auto/root-x86-64:fedora_37.1/
>/@auto/home/
>/@auto/srv/
>/@auto/tmp/
> 
>Which would be assembled by the initrd into the following via bind
>mounts:
> 
>/ → /@auto/root-x86-64:fedora_37.1/
>/home/→ /@auto/home/
>/srv/ → /@auto/srv/
>/var/tmp/ → /@auto/tmp/
> 
> If we do this, then we should also leave the door open so that maybe
> ostree can be hooked up with this, i.e. if we allow the dirs in
> /@auto/ to actually be symlinks, then they could put their ostree
> checkotus wherever they want and then create a symlink
> /@auto/root-x86-64:myostreeos pointing to it, and their image would be
> spec conformant: we'd boot into that automatically, and so would
> nspawn and similar things. Thus they could switch their default OS to
> boot into without patching kernel cmdlines or such, simply by updating
> that symlink, and vanille systemd would know how to rearrange things.

MicroOS has a similar situation. It edits /etc/fstab.

Anyway in the above example I guess if you install some updates you'd
get eg root-x86-64:fedora_37.2, .3, .4 etc?
I suppose the autodetection is meant to boot the one sorted last. What
if that one turns out to be bad though? How to express rollback in that
model?

cu
Ludwig

-- 
 (o_   Ludwig Nussel
 //\
 V_/_  http://www.suse.com/
SUSE Software Solutions Germany GmbH, GF: Ivo Totev
HRB 36809 (AG Nürnberg)


Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-04 Thread Lennart Poettering
On Mi, 03.11.21 13:52, Chris Murphy (li...@colorremedies.com) wrote:

> There is a Discoverable Partitions Specification
> http://systemd.io/DISCOVERABLE_PARTITIONS/
>
> The problem with this for Btrfs, ZFS, and LVM is a single volume can
> represent multiple use cases via multiple volumes: subvolumes (btrfs),
> datasets (ZFS), and logical volumes (LVM). I'll just use the term
> sub-volume for all of these, but I'm open to some other generic term.
>
> None of the above volume managers expose the equivalent of GPT's
> partition type GUID per sub-volume.
>
> One possibility that's available right now is the sub-volume's name.
> All we need is a spec for that naming convention.

One of the strengths of the GPT arrangement is that we can very
naturally use the type system to identify what kind of data something
contains, and then use the gpt partition label to say what it's name
is, and version (and we could encode more if we wanted). We use that
to implement a very simple A/B logic in the image dissection logic of
systemd-gpt-auto-generator, systemd-nspawn, systemd-dissect and so on:
you can have multiple partitions named "foo-0.1", "foo-0.2", "foo-0.3"
and so on, all of the same type 8484680c-9521-48c6-9c11-b0720656f69e
(the type for /usr/ partitions ofr x86-64), and then we'll
automatically pick the newest version "foo-0.3".

hence, at the baseline any such spec should have similar concepts, and
clearly be able to identify both type *and* name/version, otherwise it
couldn't match the gpt spec feature-wise.

> An early prototype of this idea was posted by Lennart:
> https://0pointer.net/blog/revisiting-how-we-put-together-linux-systems.html

Given that the gpt spec is reality and kinda established (in contrast
to what the blog story describes) i'd really focus on adding a
similar-in-spirit spec that picks up from there, and tries to minimize
conceptual differences.

Note that I'd distance any such spec from btrfs btw. btrfs subvolumes
are in many ways regular directories. Thus I think the spec should
only define how directories are supposed to be assembled, and if those
directories are actually subvolumes great, but the spec can be
entirely independent of that, i.e. it should be possible to implement
it on ext4 and xfs too.

(I personally think LVM — as an enterprise storage layer — is pretty
uninteresting for any automatic handling like this in systemd
though. If LVM wants automatic assembly they should do things
themselves, I doubt systemd needs to care. Moreover, I have the
impression that people who are into LVM and the pain it brings are
probably not the type of people who like automatic handling like
systemd-gpt-auto-generator brings it. – Yes, you might notice, I am
not a fan of LVM. I don't think ZFS is interesting either, i.e. I
wouldn't touch this with a 10m pole, given how unresolved their
licensing mess is. But I'd recommend them to just implement the btrfs
subvol ioctls, so that they could get the hookup for free. I
understand their semantics are similar enough to make this possible.)

I think implementation of a spec like this is not entirely
trivial. The thing is that we can't determine what we need to do just
by looking at the disk. We'd have to look for a specially marked root
fs, and then mount it (which might first involve luks/integrity/… and
thus interactivity), and then look into it, and then mount some dirs
it includes in a new way. This is a substantially more complex logic —
the GPT stuff is much simpler: we just look at the disk, figure things
out, and then generate mount units for it. And that's really it.

Anyway, I am not against this, I am mostly just saying that it isn't
as easy as it might look to get this working robustly, i.e. the initrd
probably would have to do things in multiple phases: first mount the
relevant fs to /sysauto/ or so, and then after looking at this mount
the right subdirs into /sysroot/ (as we usually do) and only then
transition into it.

Anyway, I think a spec like I'd do it today, taking all of the above
into account would look a bit like this:

1. define a new gpt type uuid for these specially arranged "super-root" file
   systems (a single one for all archs). (i call this "super-root" to
   make clear that the it's not a regular root fs but one that
   contains potentially multiple in parallel)

2. inside this "super-root" fs, have one top-level dir, maybe called
   "@auto" or something like that. Why do this? two reasons: so that
   we can recognize an implementation of the spec both on the block
   level (via the gpt type id) and on the fs level (via this specially
   name top-level dir). The latter is interesting for potential MBR
   compat. And the other reason is if this is used on ext4 we don't
   get confused by lost+found. (also people could place whatever else
   they want in the root dir of the fs, for example ostree could do
   its thing in some other subdir of the root fs if it wants to)

3. Inside the "@auto" dir of the "super-root" fs, 

Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-03 Thread Christopher Cox

On 11/3/21 12:52 PM, Chris Murphy wrote:

There is a Discoverable Partitions Specification
http://systemd.io/DISCOVERABLE_PARTITIONS/

The problem with this for Btrfs, ZFS, and LVM is a single volume can
represent multiple use cases via multiple volumes: subvolumes (btrfs),
datasets (ZFS), and logical volumes (LVM). I'll just use the term
sub-volume for all of these, but I'm open to some other generic term.

None of the above volume managers expose the equivalent of GPT's
partition type GUID per sub-volume.


You can't trust that information anyway.  At the end of the day, you attempt 
mount a block device.


This gets even more complicated as volumes may nest.  That is, you could have a 
logical volume in LVM that is a phyical volume in a lower context which is part 
of a volume group containing logical volumes.  Now.. probably doesn't make sense 
in most cases to try to take things that far of course.  Perhaps I should have 
used a better combo of layering, like something with logical volumes and 
software RAIDing (plus encryption, etc. lots of dev mapper possibilities).


Let's just say, there's a reason for the explicitness of fstab.  Guessing can be 
done, but at the end of the day, it's going to be a guess.  Could be a very bad 
guess.





One possibility that's available right now is the sub-volume's name.
All we need is a spec for that naming convention.

An early prototype of this idea was posted by Lennart:
https://0pointer.net/blog/revisiting-how-we-put-together-linux-systems.html

Lennart previously mentioned elsewhere that this is probably outdated.
So let's update it and bring it more in line with the purpose and goal
set out in the discoverable partition spec, which is to obviate the
need for /etc/fstab.




You'll have to move the "explicit intent" data in the "things" you discover. 
It's not there today and there are good reason why it shouldn't be there.  You 
may not like fstab, but it is an abstraction which prevents making assumptions 
about the underlying block devices.


Not saying you can't make an fstab alternative, but at the end of day, it's an 
fstab alternative (you've just moved things from "here" to "there").  Or, you've 
placed a behavioral assumption onto things that wasn't there before.  And I'd be 
careful about the latter.


A lot of my block devices are partitionless, as the good Lord intended things 
to be.



Re: [systemd-devel] the need for a discoverable sub-volumes specification

2021-11-03 Thread Chris Murphy
Lennart most recently (about a year ago) wrote on this in a mostly
unrelated Fedora devel@ thread. I've found the following relevant
excerpts and provide the source URL as well.

BTW, we once upon a time added a TODO list item of adding a btrfs
generator to systemd, similar to the existing GPT generator: it would
look at the subvolumes of the root btrfs fs, and then try to mount
stuff it finds if it follows a certain naming scheme.
https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/message/M756KVDNY65VONU3GA5CSXB4LBJD3ZIW/


All I am asking for is to make this simple and robust and forward
looking enough so that we can later add something like the generator I
proposed without having to rerrange anything. i.e. make the most basic
stuff self-describing now, even if the automatic discovering/mounting
of other subvols doesn't happen today, or even automatic snapshotting.

By doing that correctly now, you can easily extend things later
incrementally without breaking stuff, just by *adding* stuff. And you
gain immediate compat with "systemd-nspawn --image=" right-away as the
basic minimum, which already is great.
https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/message/JB2PMFPPRS4YII3Q4BMHW3V33DM2MT44/


We manage to name RPMs with versions, epochs, archs and so on, I doubt
we need much more for naming subvolumes to auto-assemble.
https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/message/VBVFQOG5EYI73CGFVCLMGX72IZUCQEYG/


--
Chris Murphy