Re: [systemd-devel] Reducing unmount/mount of partitions on soft-reboot

2024-03-14 Thread Lennart Poettering
On Mi, 13.03.24 16:57, Aditya Gupta (adit...@linux.ibm.com) wrote:

> Hello,
>
> I tried systemd-soft-reboot on a RHEL system, and it's amazing in terms
> of it's ability to do a userspace reboot, within fraction of time of a
> full system reboot. For example, for a Power system taking around 50
> seconds to do a normal reboot, it took around 4-5 seconds for a
> systemd-soft-reboot.
>
> I have a question on further optimisation. After soft-reboot, I notice
> much of the time is taken up by .device and .mount services. This was my
> observation based on 'systemd-analyze blame'. Please do let me know if
> I am seeing the wrong numbers, or if there's a better way to know.
>
> Is there some way to 'pass-through' these mounts ? That is, I might not
> need to unmount and remount my boot/root paritions.

Bind mount the relevant mounts from the current system into
/run/nextroot/ if you are using that.

If you are not using /run/nextroot/ then you can also define the mount
via a .mount unit (rather letting it be auto-generated via /etc/fstab
+ systemd-fstab-generator), and then set DefaultDependencies=no in it,
so that it does not get an implicit Conflicts= dependency on umount.target.

This is briefly documented on the systemd-soft-reboot.service man page btw.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] How to install libudev from source?

2024-03-07 Thread Lennart Poettering
On Do, 07.03.24 17:09, Vru Inbvi (vru.in...@gmail.com) wrote:

> Hi,
>
> I am struggling to install libudev from source (with Ubuntu)
> Can someone please explain what the correct way to do this is, or point me
> to relevant/updated documentation?

https://systemd.io/HACKING

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Query on sshd.socket sshd.service approaches

2024-03-06 Thread Lennart Poettering
On Mi, 06.03.24 13:06, Arseny Maslennikov (a...@cs.msu.ru) wrote:

> > The question of course is how many SSH instances you serve every
> > minute. My educated guess is that most SSH installations have a use
> > pattern that's more on the "sporadic use" side of things. There are
> > certainly heavy use scenarios though (e.g. let's say you are github
> > and server git via sshd).
>
> A more relevant source of problems here IMO is not the "fair use"
> pattern, but the misuse pattern.
>
> The per-connection template unit mode, unfortunately, is really unfit
> for any machine with ssh daemons exposed to the IPv4 internet: within
> several months of operation such a machine starts getting at least 3-5
> unauthed connections a second from hierarchically and geographically
> distributed sources. Those clients are probing for vulnerabilities and
> dictionary passwords, they are doomed to never be authenticated on a
> reasonable system, so this is junk traffic at the end of the day.
>
> If sshd is deployed the classic way (№1 or №3), each junk connection is
> accepted and possibly rate-limited by the sshd program itself, and the
> pid1-manager's state is unaffected. Units are only created for
> authorized connections via PAM hooks in the "session stack";
> same goes for other accounting entities and resources.
> If sshd is deployed the per-connection unit way (№2), each junk connection 
> will
> fiddle with system manager state, IOW make the machine create and
> immediately destroy a unit: fork-exec, accounting and sandboxing setup
> costs, etc. If the instance units for junk connections are not
> automatically collected (e. g. via `CollectMode=inactive-or-failed`
> property), this leads to unlimited memory use for pid1 on an unattended
> machine (really bad), powered by external actors.

Well, whatever sshd does as ratelimiting systemd can do to
afaics. I.e. the sshd@.service definition we suggest that and that the
big distros use all get the ExecStart=- thing right, so that an
unclean exit of sshd does not result in a pinned unit. Moreover, there's
PollLimitIntervalSec=/PollLimitBurst=, MaxConnectionsPerSource=,
MaxConnections= that ensures that any attempt to flood the socket
is reasonably contained, and the system recovers from that.

Current versions of systemd enable these settings by default, hence I
think we actually should be fine by default, even if you do not tune
these .socket parameters.

> > I'd suggest to distros to default to mode
> > 2, and alternatively support mode 3 if possible (and mode 1 if they
> > don#t want to patch the support for mode 3 in)
>
> So mode 2 only really makes sense for deployments which are only ever
> accessible from intranets with little junk traffic.

What precisely do you think is missing in systemd that
PollLimitIntervalSec=/PollLimitBurst=, MaxConnectionsPerSource=,
MaxConnections= can't cover?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Query on sshd.socket sshd.service approaches

2024-03-06 Thread Lennart Poettering
On Mi, 06.03.24 14:44, Shreenidhi Shedi (shreenidhi.sh...@broadcom.com) wrote:

> > Lennart Poettering, Berlin
>
> Thanks a lot for the responses Andrei, Poettering .
> We took it from blfs in PhotonOS.
> https://www.linuxfromscratch.org/blfs/view/11.3-systemd/introduction/systemd-units.html
> We need to do some more work on these unit files.

But that tarball actually contains a correct sshd -i line that
includes the "-" that makes the return values to be ignored as it
should.  Hence if your distro didn't do this even though it imported
this from LFS, then it's your distro that broke that...

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Query on sshd.socket sshd.service approaches

2024-03-06 Thread Lennart Poettering
On Mi, 06.03.24 11:11, Shreenidhi Shedi (shreenidhi.sh...@broadcom.com) wrote:

> Hi All,
>
> What is the rationale behind using sshd.socket other than not keeping sshd
> daemon running always and reducing memory consumption?

Note that there are two distinct modes to running sshd via socket
activation: the per-connection mode (using sshd's native inetd mode),
where there's a separate instance forked off by systemd for each
connection, and the a mode where systemd just binds the socket, but
it's served by a single instance. The latter is only supported via an
out-of-tree patch afaik though, which at least debian/ubuntu ship:

https://salsa.debian.org/ssh-team/openssh/-/commit/7fa10262be3c7d9fd2fca9c9710ac4ef3f788b08

Unless you have a gazillion of connections coming in every second I'd
probably just use the per-connection inetd mode, simply because it's
supported upstream. Would be great of course if openssh would just add
support for the single-instance mode in upstream too, but as I
understand ssh upstream is a bit special, and doesn't want to play
ball on this.

To summarize the benefits of each mode:

1. Traditional mode (i.e. no socket activation)
   + connections are served immediately, minimal latency during
 connection setup
   - takes up resources all the time, even if not used

2. Per-connection socket activation mode
   + takes up almost no resources when not used
   + zero state shared between connections
   + robust updates: socket stays connectible throughout updates
   + robust towards failures in sshd: the bad instance dies, but sshd
 stays connectible in general
   + resource accounting/enforcement separate for each connection
   - slightly bigger latency for each connection coming in
   - slightly more resources being used if many connections are
 established in parallel, since each will get a whole sshd
 instance of its own.

3. Single-instance socket activation mode
   + takes up almost no resources when not used
   + robust updates: socket stays connectible throughout updates

> With sshd.socket, systemd does a fork/exec on each connection which is
> expensive and with the sshd.service approach server will just connect with
> the client which is less expensive and faster compared to
> sshd.socket.

The question of course is how many SSH instances you serve every
minute. My educated guess is that most SSH installations have a use
pattern that's more on the "sporadic use" side of things. There are
certainly heavy use scenarios though (e.g. let's say you are github
and server git via sshd). I'd suggests to distros to default to mode
2, and alternatively support mode 3 if possible (and mode 1 if they
don#t want to patch the support for mode 3 in)

> And if there are issues in unit files like in
> https://github.com/systemd/systemd/issues/29897 it will make the system
> unusable.

Did any distro ship a unit file like that? That was clearly a buggy
(local?) unit file, I am not aware of any big distro shipping such a
unit file.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Can I provide separate enabling for dbus-activation and "normal" start ?

2024-02-23 Thread Lennart Poettering
On Do, 22.02.24 17:09, Max Gautier (m...@max.gautier.name) wrote:

> Is it possible when writing a dbus-activable service to provide two
> separate and independent ways to enable it ?
>
> The D-Bus service file would for instance be:
> [D-BUS Service]
> Name=org.freedesktop.Notifications
> Exec=notification-daemon
> SystemdService=dbus-org.freedesktop.Notifications.service
>
> The systemd service:
> [Unit]
> PartOf=graphical-session.target
> After=graphical-session.target
>
> [Service]
> Type=dbus
> BusName=org.freedesktop.Notifications
> ExecStart=notification-daemon
>
> [Install]
> Alias=dbus-org.freedesktop.Notifications.service
> WantedBy=graphical-session.target
>
>
> With that systemd service file, `systemctl enable` would cause the
> service to be started by graphical-session.target and by
> dbus-activation; but it is possible to have two separate enable
> commands, one which would enable the dbus activation, one the
> graphical-session start ?
>
> I suppose I should have two separate unit files but I'm not completely
> sure how to do that without copying the whole file (i.e, is there some
> Install/Unit relation I can use for that ?)

No, in systemd there's only one "systemctl enable" and it applies the
[Install] section of the unit file, and that's really all there is.

You can probably add two unit files and use Alias= so that they pick a
common name as alias.

But one unit cannot have two distinct [Install] sections, if that's
what you are looking for.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Issues supporting systems with and without TPM and firmware TPM (was Re: Handle device node timeout?)

2024-02-20 Thread Lennart Poettering
On Di, 20.02.24 10:24, Mikko Rapeli (mikko.rap...@linaro.org) wrote:

> Thanks, I will check this. It sounds like optee needs a similar dependency
> generator.
>
> I wonder how many kernel subsystems/drivers which need userspace daemons
> would need systemd side dependency generators. Is it only the ones inside
> initramfs and/or pre-rootfs mount which need special handling?

Well, systemd to a large part is about getting deps in order,
i.e. start things in the right order but still as parallelized as
possible to make sure we can boot properly, fast.

For regular (i.e. late boot) services things are easier, since we can
hide various deps via socket activation and services typically just
have fewer deps, but during early boot things always require careful
consideration on what you need to schedulen when. That's hardly
surprising, isn't it?

TPM stuff in particular is stuff that we want to make use of super
early, because it's inherently part of the boot process to measure
progress and resources we are using. It's what "Measured Boot" after
all means. And that means you need to know what you do, and can't
really escape that.

> In the end the logic is quite straight forward. If kernel side support is
> there, then a daemon needs to be started before user service start, but
> boot should continue without if kernel support is not detected.

systemd generators are our way to allow dynamic extension of the
systemd unit dependency graph. It's the fact that you want things
dynamic (i.e. responsive to the fact whether your system has a
specific kind of tpm device/secure enclave) that means you have to do
with a generator.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Issues supporting systems with and without TPM and firmware TPM (was Re: Handle device node timeout?)

2024-02-19 Thread Lennart Poettering
On Mo, 19.02.24 10:36, Mikko Rapeli (mikko.rap...@linaro.org) wrote:

> > After=dev-tpmrm0.device tee-supplicant@teepriv0.service
> > Wants=dev-tpmrm0.device tee-supplicant@teepriv0.service
>
> I think my problems come from:
>
> After=tee-supplicant@teepriv0.service
> Wants=tee-supplicant@teepriv0.service
>
> Basically tee-supplicant should only be started if /dev/teepriv* device node
> is available. Then in my case with fTPM devices, all TPM using and encrypted
> rootfs creating services need to depend on the service which starts 
> tee-supplicant
> but only if /dev/teepriv0 exists. If teepriv0 doesn't exist, then 
> tee-supplicant
> should not be started and the dependencies to it should not exist
> either.

Is /dev/teepriv* guaranteed to be available when userspace is invoked?
or is it something that itself requires some kmod loading to show up,
i.e. that "udevadm trigger" causes to load?

> How should this dependency be expressed in systemd services?
>
> Can tee-supplicant@.service include:
>
> Before=systemd-pcrphase-initrd.service systemd-pcrphase.service 
> systemd-pcrmachine.service
> WantedBy=systemd-pcrphase-initrd.service systemd-pcrphase.service 
> systemd-pcrmachine.service
>
> In my testing this does not seem to work inside initramfs.
>
> If systemd-pcrphase-initrd.service systemd-pcrphase.service and 
> systemd-pcrmachine.service
> service have After= and Wants= to tee-supplicant@teepriv0.service then things 
> work,
> except on boards which have no optee and no /dev/teepriv0 where 
> tee-supplicant seems
> be started and fails due to missing optee which breaks the initramfs boot.

For your usecase the new tpm2.target available in git main is what you
really should focus on: all TPM using services should order themselves
after that. All stuff needed to make a TPM device appear should be
placed before that.

The systemd-tpm2-generator that now exists in git main analyzes the
uefi/acpi firmware situation and automatically adds a dev-tpm0.device
dependency on tpm2.target if it comes to the conclusion that such a
device will show up. This generator is not going to cover your
specific case, but I think it would be a good blueprint for you:
i.e. write a generator that checks if /dev/teepriv* exists. If not,
just exit. If yes, generate the required deps to pull in
tee-supplicatnt@.service, and add the dev-tpmrm0.device dep just like
systemd-tpm2-generator does.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Issues supporting systems with and without TPM and firmware TPM (was Re: Handle device node timeout?)

2024-02-19 Thread Lennart Poettering
On Fr, 16.02.24 11:28, Mikko Rapeli (mikko.rap...@linaro.org) wrote:

> Support for fTPM devices is problematic. First, the kernel support must be 
> modules
> but loading needs to be specially handled after starting tee-supplicant. For 
> normal
> boot udev handles optee detection and triggers tee-supplicant@teepriv0.service
> startup which unloads tpm_ftpm_tee kernel module, starts tee-supplicant and 
> then
> loads the kernel module again. After this RPMB works. To do the same in 
> initramfs, I added
> Wants: and After: dependencies from systemd-repart.service, 
> systemd-cryptsetup@.service,
> systemd-pcrmachine.service and systemd-pcrphase-initrd.service:

Kernel module unloading is not supposed to happen in clean
codepaths. It's a debug/development feature, it's not safe to do as
part of regular boot.

But why do you need an unload a kernel module at all? that smells...

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Handle device node timeout?

2024-02-19 Thread Lennart Poettering
On Di, 16.01.24 16:06, Mikko Rapeli (mikko.rap...@linaro.org) wrote:

> Hi,
>
> I have services which depend on a specific device node. How can I run
> some recovery actions when the default 90s timeout for finding this
> device is hit?
>
> OnFailure= doesn't work as the service is not even started.
>
> Specifically the case is about supporting TPM2 encrypted rootfs but falling
> back to plain-text rootfs generation if there is no TPM2 device. Currently
> my initramfs works with TPM2 but without it fails with:

In git main there's new infra to deal with this case:

https://github.com/systemd/systemd/pull/30194

That should hopefully solve this systematically and generically.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] logind: Activating session/opening seat fails in systemd v254

2024-02-16 Thread Lennart Poettering
On Do, 15.02.24 22:16, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> Hi everyone,
>
> I am working on a kiosk-type device which is supposed to start a
> weston instance upon boot.
> Our images were previously based on Debian 12 and Fedora 38, now we
> are working on unifying them. Between the two old image variants the
> systemd units were mostly identical, however, on Fedora 39 with
> systemd 254 they no longer work. Weston/libseat now fails with the
> message: "Could not activate session: Permission denied". (Also see
> the logind logs at the end).

Neither Weston nor libseat (whatever that is) are a systemd
thing. Please contact the relevant projects for help?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Scan all USB devices from Linux service

2024-02-14 Thread Lennart Poettering
On Mi, 14.02.24 20:24, Muni Sekhar (munisekhar...@gmail.com) wrote:

> HI all,
>
> USB devices can have multiple interfaces (functional units) that serve
> different purposes (e.g., data transfer, control, audio, etc.).
>
> Each interface can have an associated string descriptor (referred to
> as iInterface). The string descriptor provides a human-readable name
> or description for the interface.
>
> >From user space service utility, How to scan all the USB devices
> connected to the system and read each interface string
> descriptor(iInterface)  and check whether it matches "Particular
> String" or not.

You can use sd-device.h, allocate an sd_device_enumerator_new(), then
apply some filter via sd_device_enumerator_add_match_sysattr() and
then enumerate through it via
sd_device_enumerator_get_device_first()/sd_device_enumerator_get_device_next().

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Issue with systemd-logind

2024-02-14 Thread Lennart Poettering
On Mi, 14.02.24 15:03, Akshaya Maran (akshayamara...@gmail.com) wrote:

> Hi,
>
> I am trying to run weston11.0.1 using systemd logind launcher but got this 
> error
> " logind: failed to get session seat
> logind: cannot setup systemd-logind helper error:"

This looks like an error message from some weston thing. Please ask
that community for help.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] ConditionNeedsUpdate, read-only /usr, and sysext

2024-02-14 Thread Lennart Poettering
On Mi, 07.02.24 20:42, Valentin David (m...@valentindavid.com) wrote:

> Hello everybody,
>
> The behavior of ConditionNeedsUpdate is that if /etc/.updated is
> older than /usr/, then it is true.
>
> I have some issues with this. But maybe I do not use it the right
> way.
>
> First, when using a read-only /usr partition (updated through
> sysupdate), the time of /usr is of the build of that filesystem. In
> the case of GNOME OS, to ensure reproducibility bit by bit, we set
> all times to some time in 2011. So that does not work for us.

Hmm, I wonder if the os-release file in /usr/ should optionally have a
timestamp field which could be used. That could be directly
initialized from $SOURCE_DATE_EPOCH at build time (maybe the field
should even be named like that). I think that would make sense, no?

> But now let's say we work-around that, and we make our system take a
> date that is reproducible, let's say the git commit of our
> metadata. Then we have a second issue.
>
> Because of systemd-sysext, it might be that /usr is not anymore the
> time of the /usr filesystem, but the time of a directory created on
> the fly by systemd-sysext (or maybe it keeps the time from the /
> fileystem, I do not know, but for sure the time stamp is from when
> systemd-sysext was started). If systemd-update-done happens after
> systemd-sysext (and it effectively does on 254), then the date of
> /etc/.updated will become the time when systemd-sysext started.

Uh. That'd be a bug. Can you file an issue about this?

> Let's imagine that I do not boot that machine often. My system is
> booting a new version. And there is already another new version
> available on the sysupdate server. My system will download a build
> of /usr that is likely to be older than the boot time. So next
> reboot, the condition will be false, even though I did have an
> update. And it will be false until I download a version that was
> built after the boot time of my last successful update.
>
> So my question is, is there plan to replace time stamp comparison
> for ConditionNeedsUpdate with something that works better with
> sysupdate and sysext? Maybe copying IMAGE_VERSION from
> /usr/lib/os-release into /etc/.updated for example?

Yeah, we should fix this.

I have so far never though about the mixture of sysext and
ConditionNeedsUpdate=. This is unchartered territory. But I think we
can fix this. But please open issues about this.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] What creates a new machine-id ?

2024-02-08 Thread Lennart Poettering
On Do, 08.02.24 09:35, Agrain Patrick (patrick.agr...@al-enterprise.com) wrote:

> Hello,
>
> Our embedded system is based on a Rocky Linux 8 distribution which embeds 
> systemd-239.
>
> At first bootup, a machine-id is created and remains persistent over the 
> following reboots.
> System upgrade sometimes creates a new machine-id, sometimes not.
> By 'system upgrade', I mean either new linux kernel or upgraded Rocky 
> packages or both.
>
> Could you precise me what event(s) in the previous upgrade cases
> trigger a new machine-id ?

See:

https://www.freedesktop.org/software/systemd/man/latest/machine-id.html#Initialization

Or in other words: the machine ID is supposed to be persisted in
/etc/. if your upgrade procedure somehow causes the machine ID to be
invalidated somehow, then we'll assign a new one though. We basically
make sure that whatever happens, on boot we initialize it.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Detecting Systemd crash

2024-02-05 Thread Lennart Poettering
On Sa, 03.02.24 16:55, Álvaro Cebrián Juan (acebrianj...@gmail.com) wrote:

> Great question!
>
> I am very interested in detecting systemd crashes too since I have
> experienced them recently and have been asked to come up with a solution to
> react when a PID1 crash happens.
> In fact, in my recent experiences, a journald crash was enough to render
> the system into an unreliable/degraded state in which some top-level
> applications worked while others didn't.
>
> So adding to David's 1st question, I need to detect systemd and journald
> crashes and then trigger a `systemctl reboot --force --force`
> command

As mentioned elsewhere in this thread just use RuntimeWatchdogSec= in
systemd-system.conf(5)

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Detecting Systemd crash

2024-02-05 Thread Lennart Poettering
On Mo, 05.02.24 13:54, Lennart Poettering (lenn...@poettering.net) wrote:

> you can just use the usual hw watchdog. If pid1 dies it will not ping
> the hw watchdog, and thus a reset is triggered automatically. In fact
> we actually configure the hw watchdog by default these days on hw that
> has it (which are most PCs).

Actually, we don't really, I need to correct myself. We probably
should though, dunno.

See RuntimeWatchdogSec= in systemd-system.conf(5)

>
> > 2: How do I get Systemd to freeze to test such program? I mean, if I kill
> > Systemd, the kernel would crash so I have to somehow tell Systemd to freeze?
>
> Not really, the kernel blocks SIGSTOP for PID1.
>
> Lennart
>
> --
> Lennart Poettering, Berlin

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Detecting Systemd crash

2024-02-05 Thread Lennart Poettering
On So, 04.02.24 00:06, David Timber (d...@dev.snart.me) wrote:

> Systemd crashed on me the other day. I was writing up some Systemd units and
> testing them out by daemon-reload every time I wanted to test them out. Not
> the best way to go on about, I know. My bad abusing Systemd to the point of
> crashing. Perhaps it was just a bit flip that caused this.
>
>systemd[2368]: Assertion 'path_is_absolute(p)' failed at
>src/basic/chase.c:628, function chase(). Aborting.
>systemd[1]: Assertion 'path_is_absolute(p)' failed at
>src/basic/chase.c:628, function chase(). Aborting.
>systemd[1]: Caught  from our own process.
>systemd-coredump[32497]: Due to PID 1 having crashed coredump
>collection will now be turned off.
>systemd-coredump[32497]: [] Process 32496 (systemd) of user 0
>dumped core.
>systemd[1]: Caught , dumped core as pid 32496.
>systemd[1]: Freezing execution.
>
>...
>
>systemd-journald[871]: Failed to send stream file descriptor to
>service manager: Transport endpoint is not connected
>
> I didn't even bother trying producing stack trace. I can get on that if
> anyone wants it. My machine started doing some weird things like
> Firefox not

If this is a current systemd version (v255), please generate a stack trace
and submit it as github issue to us, we'll look into it. If it's
older, please report to your distro first.

> being able to do Ajax properly whilst being able to go to a new page,
> Chromium not being able to create a new tab whilst all the text editors
> worked just fine, all the systemctl commands timing out. So basically, I was
> using Linux without fork(). Anyway.
> Well, I think any software can crash for any reason whatsoever. The
> problem

Yeah, an assert like the above is an error we need to fix in systemd.

> with Systemd I realised from this incident is that I had no way of knowing
> that Systemd had crashed until I opened up the journal and kernel logs and
> saw that Systemd had crashed some time ago. In this particular incident,
> Systemd caught the signal and decided to just freeze. No idea why you'd want
> that because if it had just crashed, the kernel would have just panicked and
> I would have realised something went wrong.
>
> 1: So I decided that I need a some sort of "watchdog" that warns me when
> something like this happens. Using dbus to poll the status of the Systemd
> process, it could be a GUI app running under a seat, just a daemon that
> writes a warning message using `wall` or just send mail using a primed up
> MUA process. I wonder if someone already had the same idea and went on to
> make one.

you can just use the usual hw watchdog. If pid1 dies it will not ping
the hw watchdog, and thus a reset is triggered automatically. In fact
we actually configure the hw watchdog by default these days on hw that
has it (which are most PCs).

> 2: How do I get Systemd to freeze to test such program? I mean, if I kill
> Systemd, the kernel would crash so I have to somehow tell Systemd to freeze?

Not really, the kernel blocks SIGSTOP for PID1.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-pcrlock Failed to submit super PCR policy

2024-02-05 Thread Lennart Poettering
On Mo, 05.02.24 09:24, Dominick Grift (dominick.gr...@defensec.nl) wrote:

Please run "SYSTEMD_LOG_LEVEL=debug systemd-pcrlock make-policy" from
the command line, then file a github issue about this, and pastethe
output there.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Systemd units complains about cgroup with 5.15.x kernel

2024-02-01 Thread Lennart Poettering
On Do, 01.02.24 16:30, Thierry Bultel (thierry.bul...@linatsea.fr) wrote:

> Hi,
>
> I am using systemd v255,
> and currently using a kernel vendor branch :
>
> g...@github.com:varigit/linux-imx.git
> lf-5.15.y_var01
> imx_v7_defconfig
>
> I had no issue with the older 5.4 kernel.
>
> I have verified that the kernel has the following options:
>
> CONFIG_DEVTMPFS=y
> CONFIG_CGROUPS=y
> CONFIG_INOTIFY_USER=y
> CONFIG_SIGNALFD=y
> CONFIG_TIMERFD=y
> CONFIG_EPOLL=y
> CONFIG_UNIX=y
> CONFIG_SYSFS=y
> CONFIG_PROC_FS=y
> CONFIG_FHANDLE=y
>
> CONFIG_NET_NS=y
>
> CONFIG_SYSFS_DEPRECATED is not set
>
> CONFIG_AUTOFS_FS=y
> CONFIG_AUTOFS4_FS=y
> CONFIG_TMPFS_POSIX_ACL=y
> CONFIG_TMPFS_XATTR=y
>
> --->
>
> systemd is failing to start some units:
>
> systemd[1]: wpa_supplicant.service: Failed to create cgroup
> /system.slice/wpa_supplicant.service: No such file or directory
> and also;
>  (agetty)[217]: serial-getty@ttymxc0.service: Failed to attach to cgroup
> /system.slice/system-serial\x2dgetty.slice/serial-getty@ttymxc0.service: No
> medium found
>
> ... and I do not have a serial console.
>
> I am currently digging into systemd code to find out what is possibly wrong
> .. but if anyone gets a clue, I would appreciate !

Educated guess, you have no cgroupvs2 or so?

Would make sense to provide logs?, use strace to check what precisely
fails?

Ask you distro for help?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Delaying VM startup until block devices are available

2024-01-26 Thread Lennart Poettering
On Do, 25.01.24 16:28, Orion Poplawski (or...@nwra.com) wrote:

> We have various VMs that are back by luks encrypted LVs.  At boot the volumes
> are decrypted by clevis.  The problem we are seeing at the moment is that the
> VMs are started before the block devices are decrypted.  Our current
> solution is:

We generally wait for all devices listed in /etc/crypttab, unless you
set noauto or nofail.

>
> # cat /etc/systemd/system/virtqemud.service.d/override.conf
> [Unit]
> After=blockdev@dev-mapper-luks\x2dbackup.target
> blockdev@dev-mapper-luks\x2dvm\x2d01\x2ddisk0.target
>
> Where we list each of the volumes to be decyrpted as blocking the virtqemud
> service.
>
> Does anyone have any better alternatives?  My main issue it that it feels
> somewhere in between fine-grained and coarse-grained control.
>
> Ideally I think one would be able to have each individual VM startup
> automatically delayed until the devices each used became available, but I
> don't see how to do this.

I am not sure how libvirt works, but if it runs every VM in a systemd
unit, then you could just order the device before that unit, or the
unit after the device.

Really depends on how libvirt splits things up.

> Alternatively it seems like one should be able to delay all VM startup until
> all volumes in /etc/crypttab were unlocked, rather than having to specify each
> one.  But I don't see a target for that.

This is default behaviour. Anything listed in /etc/crypttab is ordered
before cryptsetup.target, which is ordered before sysinit.target,
which is ordered before basic.target, which is ordered before regular services.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Bump: Testing LogFilterPatterns= on user-level services

2024-01-26 Thread Lennart Poettering
On Do, 25.01.24 22:29, Farblos (akfkqu.9df...@vodafonemail.de) wrote:

> Hi.
>
> I sent below mail some week ago, Barry's reply left me unsure as to
> whether this would be a bug or not.  I still tend do assume that I'm
> "doing something wrong".

This is currently not supported. The filters are communicated by the
service manager to journald via xattrs on the cgroups, and journald
will only consider those for cgroups owned by root, i.e. not on
cgroups delegated to unpriv users like this done for systemd --user
instances.

Interepreting arbitrary regexes configured by unpriv code in priv code
comes at some risk,. becose afair constructing them can come at O(2^n)
time, i.e. a rogue regex could make use consume unbounded time on
processing journal messages.

Hence, I wouldn't hold your breath. Unless someone figures out a smart
way to deal with this it's unlikely to be supported.

We should document this however I guess. Hence if you file an issue
that would be more than welcome, so that we can keep trakc of this.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Permanently remove services

2024-01-19 Thread Lennart Poettering
On Do, 18.01.24 23:40, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> > > They are turning up as failed units, so they are being run,
> > > even if I don't have any TPM module. Also, I have a notifier in
> > > my waybar telling me of failed services and I don't want to see
> > > them there.
> >
> > Can you provide logs about this? The goal is definitely to make these
> > NOPs on TPM-less systems. I am a bit puzzled that the conditioning
> > they come with is not sufficient. We might need to tweak something
> > there then.
> >
> > The idea is that the system does TPM setup on systems that have a tpm
> > and on systems lacking that silently just skips all these so that
> > everything always works fully automatically and robustly without any
> > ugly error output.
> >
> > hence, any chance you can provide logs about this? and what kind of
> > system is this? i.e. does it really lack a tpm?
>
> In the past I have seen errors on systems which do not have
> libtss2/tpm2-tss installed though I am not sure if those should be
> silenced. After all, the unit being enabled means that one wants to
> use it if possible - and if the libraries are missing that should be
> noticeable to the user instead of a silent fail.

No, the libs are installed, that's what the "systemd-creds has-tpm2"
output shows.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Permanently remove services

2024-01-18 Thread Lennart Poettering
On Do, 18.01.24 22:53, Morten Bo Johansen (morte...@hotmail.com) wrote:

> ~/ % systemd-creds has-tpm2
> partial
> +firmware
> -driver
> +system
> +subsystem
> +libraries

OK, so this indicates that your system has TPM support on all levels
with a single exception: you lack an actual linux driver for your
specific hw. And that puzzles me. because to my knowledge at least
linux should support all relevant tpm2 interfaces just fine. THis
suggests that you haven#t got the right modules installed.

i don't know arch but is there possibly some extra package you have to
install to get more drivers?

tpm2 drivers are super basic stuff, it sound really weird to me to
split this out. It's a condition this stuff indeed is not prepared for
though: that everything is set up properly, from firmware to kernel to
userspace, but the driver is not actually available.

> The output from journalctl --unit systemd-tpm2-setup-early.service:
>
>-- Boot b3fca98d73f6441590174a72ac0d27fa --
>jan 18 18:13:02 gatsby systemd-tpm2-setup[329]: Failed to create TPM2 
> context: State not recoverable
>jan 18 18:13:02 gatsby systemd-tpm2-setup[329]: 
> ERROR:tcti:src/tss2-tcti/tcti-device.c:451:Tss2_Tcti_Device_Init() Failed to 
> open specified TCTI device file /dev/tpmrm0: No such file or direc>
>jan 18 18:13:03 gatsby systemd[1]: systemd-tpm2-setup-early.service: Main 
> process exited, code=exited, status=1/FAILURE
>jan 18 18:13:03 gatsby systemd[1]: systemd-tpm2-setup-early.service: 
> Failed with result 'exit-code'.
>jan 18 18:13:03 gatsby systemd[1]: Failed to start TPM2 SRK Setup (Early).
>
> There is a /dev/tpm0 file but not a /dev/tpmrm0 file

Oh, interesting. Is it possible that your system has only a TPM 1.2
device? (maybe your bios allows switching between TPM 2.0 and 1.2 modes)

It could be that we simply misdetect the tpm 1.2 case, i admittedly
never tested things on such a system. how old is that PC?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Permanently remove services

2024-01-18 Thread Lennart Poettering
On Do, 18.01.24 22:26, Morten Bo Johansen (morte...@hotmail.com) wrote:

> On 2024-01-18 Lennart Poettering wrote:
>
> > hence, any chance you can provide logs about this? and what kind of
> > system is this? i.e. does it really lack a tpm?
>
> I shall try to accommodate you. How do I get the log?
>
> The command "systemctl --plain --no-legend list-units --state=failed"
> does not provide enough info.

ideally boot with "systemd.log_level=debug" on the kernel cmdline, and
then paste "journalctl -b" somewhere.

The full output of "systemd-creds has-tpm2" would be good too.

> I have no external TPM module installed and I don't think my
> rather old cpu, "Intel(R) Core(TM) i5-4570T CPU @ 2.90GHz", has
> any on-board TPM2 capablility?

That sounds fairly recent, so I would assume that your machine has a
TPM.

Which OS is this? Is it possible that your kernel has TPM2 support
enabled, but for some reason the driver for your hw is not available
(for example not included in the initrd)?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Permanently remove services

2024-01-18 Thread Lennart Poettering
On Do, 18.01.24 19:43, Morten Bo Johansen (morte...@hotmail.com) wrote:

> On 2024-01-18 Andy Pieters wrote:
>
> > Not being funny, but why care? They have got a conditional check in them
> > and will only run when it makes sense.
> > So these units will do nothing and won't delay your boot or take up
> > resources
>
> They are turning up as failed units, so they are being run,
> even if I don't have any TPM module. Also, I have a notifier in
> my waybar telling me of failed services and I don't want to see
> them there.

Can you provide logs about this? The goal is definitely to make these
NOPs on TPM-less systems. I am a bit puzzled that the conditioning
they come with is not sufficient. We might need to tweak something
there then.

The idea is that the system does TPM setup on systems that have a tpm
and on systems lacking that silently just skips all these so that
everything always works fully automatically and robustly without any
ugly error output.

hence, any chance you can provide logs about this? and what kind of
system is this? i.e. does it really lack a tpm?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Activation environment(s)?

2024-01-15 Thread Lennart Poettering
On Fr, 12.01.24 18:16, Vladimir Kudrya (vladimir-...@yandex.ru) wrote:

> On 08/01/2024 22.26, Simon McVittie wrote:
> > It is not possible to unset a variable in the dbus-daemon's activation
> > environment, or with `dbus-update-activation-environment`: that's an
> > API limitation in the org.freedesktop.DBus interface. I've thought about
> > adding an UnsetAndSetActivationEnvironment() that could do this.
> >
> > It *is* possible to unset a variable in the `systemd --user`
> > activation environment, with `systemctl --user unset-environment` or
> > the UnsetEnvironment() and UnsetAndSetEnvironment() D-Bus methods on the
> > systemd manager. If your distribution is using dbus-broker rather than
> > dbus-daemon, and if Mantas was correct to say that dbus-broker does not
> > have its own separate activation environment, then that should be enough
> > to affect all D-Bus session services. It will also affect all other
> > systemd user services.
>
> Thank you. I now recommend dbus-broker in my session manager's readme
> (https://github.com/Vladimir-csp/uwsm), and management of dbus activation
> environmentis now conditional on dbus unit true name not being
> dbus-broker.service.
>
> BTW, the whole reason I even decided to interact with dbus is rather
> aggressive session termination by systemd. It seems to send signals not only
> to existing processes in the session, but even to new ones which were
> spawned in response to those signals. This makes it impossible to fork a
> systemctl process to stop related user units.
>
> I solved this by interacting with dbus without spawning new processes, but,
> just for info, is there a proper way to fork something that survives for a
> bit in a session that is being terminated? With simple tools like `trap
> 'something' TERM HUP` in a shell? Or maybe there is some other more native
> way to bind a user level unit to a particular session scope?

When the goal is to shut down a service/session, then intend to give
guarantees that the shut down time is bounded: we first send SIGTERM,
and start a timeout. If by that timeout there are still processes left
we SIGKILL to put an end to things. If we'd somehow distinguish
new/old processes then we couldn't put the boundary on the shutdown
process...

So no, this does not exist. You can fork if you want, but it won't add
time to the time-out.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Potential systemd CoredumpFilter sandboxing issue

2024-01-10 Thread Lennart Poettering
On Mo, 08.01.24 04:04, daechir (daec...@protonmail.com) wrote:

> Hello again,
> Thanks for fixing the utmp build issue from Nov 2023. I lost the email and 
> couldn't figure out how to write to it.
>
> I found another issue that seems to be a bit more complicated. I'll try to 
> describe it as best I can.
>
> When booting with the kernel parameter coredump_filter=0x0, all
> processes should read coredump_filter (at /proc/*/coredump_filter)
> as , or private-anonymous. This behavior works as
> intended. However, when specifying this kernel parameter, and also
> setting the systemd sandboxing option
> CoredumpFilter=private-anonymous, some services still tend to ignore
> or overwrite this value. I have found with v255 that
> /usr/lib/systemd/systemd --user is one such example, or
> user@.service which sets its /proc/*/coredump_filter to 0001
> instead.

As per kernel docs the kernel command line option only sets the
*default*, i.e. userspace can override it. So the behaviour works as
intended?

Quoting kernel docs:

coredump_filter=
[KNL] Change the default value for
/proc//coredump_filter.

> Am I wrong in understanding that private-anonymous usually maps to ?
> Also, wouldn't 0001 show something like coredump_filter=0x01 or 
> CoredumpFilter=shared-anonymous?

I cannot parse this.

Lennart

--
Lennart Poettering, Berlin


Re: Can mkosi replace Kickkstart / Calamares?

2024-01-02 Thread Lennart Poettering
On Mo, 25.12.23 02:39, Patrick Schleizer (patrick-mailingli...@whonix.org) 
wrote:

> Hi,
>
> I am maintaining a systemd, Debian-based Linux distribution (Kicksecure) and
> am considering moving to mkosi as the "base image creation tool".
>
> It seems mkosi is a fine OS image builder. With systemd-repart, you even
> solved the resizing of partitions at the first boot, which is magic.
>
> Suppose a Linux distribution is providing an OS image that can be written to
> USB. Maybe soon, even to a CD/DVD. [1]
>
> Suppose that OS image is supposed to be able to act as an installer, so the
> user can use it to install it on an internal hard drive.
>
> Is something like Kickstart or Calamares still required? It seems (at least
> Calamares, whose code I am reading) is kind of "yet another OS image
> builder". It doesn't build an image but instead writes to a hard drive.
> However, I find it problematic that a lot of code (creating partition
> tables, creating file systems, making bootable) is duplicated. [2]

I don't really know what Kickstart/Calamares really do.

But it's certainly our intention to allow systemd-repart to operate
like an installer, in the sense that you boot from a USB stick and you
can use systemd-repart to copy the relevant partitions you just booted
from to a target disk very efficiently, which will then be basically
the same OS, just with maybe differently sized data/home partitions,
new uuids, different crypto keys and such.

More specifically, systemd-repart + bootctl install +
systemd-firstboot is supposed to be enough to do what a classic
installer disk can do on traditional OSes.

Note that currently there are still some gaps, but people are workng
on this in various places.

> Do you have any suggestions?
>
> Did you envision replacing installers, or do you already have tools for
> that?

Well, depends on what you mean by "installers". We certainly have no
interest to replace a package-based installer. But we certainly do
want to provide you with basic tools which you can combine into an A/B
image-based OS installer

> [2] But what about installer questions, customization such as time zone,
> keyboard layout? I think the crucial question for an installer is the target
> drive, and that's it. Perhaps partitioning and file system choices, but that
> is more for geeks. How about time zone, keyboard layout? Valid points. But I
> think those would be better handled through a first-boot GUI wizard.

systemd-firstboot is supposed to be just that – but it only covers the
offline and console cases. It's also supposed to be useful as a blueprint
to implement something similar in a graphical tool.

systemd-firstboot can be used in two modes. In "offline" mode, where
you call it from the cmdline and specify --root= or --image= to let it
operate directly on an OS tree you mounted somewhere or on a block
device/image file you have accessible. Or in "online" mode where it is
run at first boot, and asks the user interactively.

systemd-firstboot covers hostname, locale, keyboard, timezone, root pw
currently. In systemd git main you also fine a "homectl firstboot"
command which can prompt the user interactively for a user to create
at boot.

Regarding partitioning: my thinking was that installers would ship
multiple alternative sets of repart .conf files, of which the first
that can be applied is applied or of which the user can pick one
explicitly, depending on the use case. The focus is clearly on
automatic partitioning here though, if people want to manually and
precisely set the sizes of each partition in a UI, then repart is not
the tool they should use.

Lennart

--
Lennart Poettering, Berlin


Re: systemd-sysupdate support for slow rollout (aka A/B testing)

2024-01-02 Thread Lennart Poettering
imited to no interest over precise
> control of updates and user devices and the users wish for anonymity.
> On the other hand though are enterprises which deploy sysupdate for
> (I)IoT devices. In these case devices commonly have to be registered
> anyhow, and the enterprise controls how updates are rolled out etc. In
> these cases anonymity is not necessary and instead customers often pay
> the enterprise to perform all the management on their behalf.

I think adding some concept for this would be entirely fine, but this
really should be opt-in. Happy to review a patch for this.

I think in the longer run we need to hook this up with remote
attestation though, i.e. instead of just including the machine ID,
include a quote from the TPM about PCR 15 (which includes a
measurement of the machine ID), signed by some suitable local TPM
key. That would make it a bit harder for clients that were hacked to
play games with you, and report incorrect machine IDs.

Lennart

--
Lennart Poettering, Berlin


Re: systemd-sysupdate support for slow rollout (aka A/B testing)

2024-01-02 Thread Lennart Poettering
On Di, 02.01.24 13:11, Simon McVittie (s...@collabora.com) wrote:

> Prior art: Debian/Ubuntu apt does slow rollout for packages like
> this, with simple filesystem-based http mirrors combined with "smart"
> clients. It works by adding a Phased-Update-Percentage field to the
> metadata of each package. The client calculates some sort of ID for itself
> (I don't know precisely how), and then takes the upgrade if it finds that
> its ID is in the first x% of the available range.
>
> If I understand correctly, Ubuntu is using this mechanism in production
> but Debian is not.
>
> Using some sort of hash of the machine ID + the proposed version would
> probably have the behaviour you want, of choosing a different x% of
> machines to be the early-adopter set for each update?

Yes, this is what I think would be the right approach.

> > This would then mean for the server that it would first serve
> > foobar_47.11_3.raw which would be version 47.11 of the OS, and 3% of
> > the hosts would update to it. And then, once you collected enough
> > feedback you'd rename the file to foobar_47.11_25.raw and 25% of the
> > hosts would switch over. Finally you'd set the value to 100 (or maybe
> > just drop it, which should be considered equivalent to 100), and then
> > all remaining hosts would update.
>
> If you're using a hash of the machine ID + the proposed version as
> your randomization, then I think you'd want to have a single image (or
> version ID, or some other unique identifier) for each proposed update, and
> separately, a metadata field that sets *x* in the instruction "if you have
> figured out that you are in the first x% of machines, upgrade". Otherwise,
> publishing foobar_47.11_3.raw followed by foobar_47.11_25.raw would be
> more likely to result in approximately (3% + 25% = 28%) of machines
> upgrading[1], because the client doesn't know that it's actually the
> same update and would "re-roll the dice" for each republished name.

My thinking was that clients would look at multiple entries which only
differ by the percentage (i.e. are identical in name and version) and
drop all of them but the one with the highest percentage, and ignore
all others.

Lennart

--
Lennart Poettering, Berlin


Re: sysupdate: Limit update to at most one major version

2024-01-02 Thread Lennart Poettering
On Di, 02.01.24 13:49, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> > I'd be fine with adding MaxVersion=. Happy to review a patch, merge
> > something like this (at least file an RFE issue)
>
> Should that be inclusive or exclusive? Naming it MaxVersion would
> imply it to be inclusive though an exclusive bound would likely be
> more useful most of the time. One could then specify MaxVersion=1.3.0
> in their 1.2.x images and once they have an upgrade path they would
> explicitly raise the max version in e.g. 1.2.15. Otherwise they would
> have to specify 1.99.99.
> In retrospect a VersionBound= property with syntax similar to
> ConditionKernelVersion= would have been better though I guess that
> ship has sailed - or is it? Is sd-sysupdate still considered
> experimental? Not sure if this warrants such a change though :shrug:

We do not allow "=" nor "<" in version strings, as per
https://uapi-group.org/specifications/specs/version_format_specification/.

Hence we could use that fact and say: "MaxVersion= <=47.11",
"MaxVersion= <47.11" could be used to make the type of version
comparison explicit. This would implement a tiny subset of the
ConditionKernelVersion= logic, and simply default to imply <= if the
comparison is not specified explicitly.

Of course, a similar logic should then be implemented for MinVersion,
i.e. >= and >

> Should we continue this discussion on the mailing list or an issue?

Issue is better.

Lennart

--
Lennart Poettering, Berlin


Re: sysupdate: Limit update to at most one major version

2024-01-02 Thread Lennart Poettering
On So, 31.12.23 14:43, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> Hello,
>
> we are currently using sd-sysupdate to roll out updates and we're wondering
> if there is any possibility to limit updates to consider at most one next
> major version. This would allow us to write the software to handle only
> data migrations from the previous major version instead of any version
> beforehand.
> The only thing I have been able to find is MinVersion= which seems to do
> exactly the opposite of what we would want to do.

I'd be fine with adding MaxVersion=. Happy to review a patch, merge
something like this (at least file an RFE issue)

Lennart

--
Lennart Poettering, Berlin


Re: systemd-sysupdate support for slow rollout (aka A/B testing)

2024-01-02 Thread Lennart Poettering
On Mi, 20.12.23 19:04, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> Hey everyone,
>
> does sysupdate currently support any way to slowly roll out updates
> where the server providing the files can be in control? This would be
> used to slowly make a new version available and have it at e.g. 1%
> adoption for a day to monitor regressions before increasing the
> coverage. I was unable to find any information about it in the
> documentation.

This is currently not available, no.

The idea so far was always that the server is dumb, and the client
picks the release it wants.

I have thought about this usecase a while back, and my thinking was
that such a staged update logic should be driven by the machine
ID. i.e. we should teach sysupdate a simple logic that allows pattern
matching of new versions based on some arithmetic of the machine
ID. More specifically, include some value in the URL pattern that
indicates the percentage of hosts that shall update to this
release. Then, each client takes its machine ID, treats it as an
integer and calculates modulo 100 of it or so, and then checks if the
resulting value is below the intended percentage, and if so it
updates, otherwise it doesn't.

(or something like that, the above is probably not ideal, since it
would mean it's always the same hosts that try a new release first,
and it probably should be evened out across the set of clients).

This would then mean for the server that it would first serve
foobar_47.11_3.raw which would be version 47.11 of the OS, and 3% of
the hosts would update to it. And then, once you collected enough
feedback you'd rename the file to foobar_47.11_25.raw and 25% of the
hosts would switch over. Finally you'd set the value to 100 (or maybe
just drop it, which should be considered equivalent to 100), and then
all remaining hosts would update.

The effect of this is that client's could still explicitly upgrade if
they want, and the updates would be entirely driven by the clients,
but simply via naming the download images the server can control that
"by default" only the chosen number of clients update.

> Currently it seems like I would have to implement a different service
> which calls the sysupdate binary (or uses dbus once #28134 has landed)
> and then decides based on some other information.
>
> One idea I had would be that systemd-pull could send the machine-id
> based on which the server could then decide to provide the newer file
> (e.g. last two chars == "00" would roll it out to ~1/255). Though I am
> not sure if sd-pull is supposed to be "anonymous", i.e. do not provide
> this identifying information. Another drawback of this would be that
> stateless systems which reboot often get a new machine-id each boot,
> thus having an increased chance to get the newer version.

So this idea is not entirely different from my idea, I was just
thinking about pushing this into sysupdate rather than pull.

> Does anything like this already exist or is planned? Or should that be
> done by different applications on the client side?

I think it makes a ton of sense to add this to sysupdate. Would love
to review/merge a patch for that.

> I also remember there being a discussion about plugging in different
> sd-pull like implementations/backends[1] to support delta updates,
> other transports, or TLS client authentication. This could at least be
> adapted to support my idea to send the machine-id as an HTTP header
> (e.g. X-MACHINE-ID).

If we can avoid it, I'd always adopt a logic whether identifying info
doesn't have to be sent to the server. After all the logic should be
generic and applicable in scenarios where the client should get
anonymity as much as it wants.

The machine-id we usually consider a "half-secret", i.e. all local
programs get access to it (unless sandboxed), but they are not
supposed to be send it across the wire. If they really need to send
some identifier across the wire they should derive an app-specific ID
instead, which we make easy to acquire via
sd_id128_get_machine_app_specific().

But better than app-specific machine IDs are no machine IDs at all in
the protocol, if we can get away with it. Hence, my idea of doing the
rollout percentage logic client-side.

Lennart

--
Lennart Poettering, Berlin


Re: Query on dynamic update of Kernel comandline

2023-12-21 Thread Lennart Poettering
On Mi, 13.12.23 10:28, HARSHAL PATIL (QUIC) (quic_hgpa...@quicinc.com) wrote:

> Hello Fellow Community,
>
> I have a following question. Your help will be really appreciated.
>
> "Kernel expects few params from kernel cmdline from underlying
> firmware. Is there any mechanism to dynamically update the cmdline
> updated by firmware (UEFI) during boot time in systemd-boot similar
> to DT fixup protocol ?"

I don't understand the question. Why would stuff from the UEFI
firmware be added to the kernel cmdline? and what does that have to do
with the DT fixup protocol?

Various UEFI interfaces are available from userspace anyway, are you
sure that whatever data you are looking for isn't readily available
from /sys/ anyway? why must it be kernel command line?

> Example : androidboot.serialno is read from firmware and needs to be appended 
> to kernel commandline

I don't know what this is, and what that has to do with uefi, sd-boot
or dt?

Anyway, the question is very confusing, I am not surprised noone
answered so far.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-16 Thread Lennart Poettering
On Do, 14.12.23 02:17, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> On Wed, Dec 13, 2023 at 10:03 AM Lennart Poettering
>  wrote:
> >
> > On Di, 12.12.23 23:01, Nils Kattenbeck (nilskem...@gmail.com) wrote:
> >
> > > > sysexts are erofs or squashfs file systems with verity backing. Only
> > > > the sectors you access are decompressed.
> > >
> > > Okay I forgot that they were erofs based and mentioned cpio archives
> > > so I assumed they would be one.
> > > Do they need to be fully read from disk to generate the cpio archive?
> >
> > erofs is a file system, cpio is a serialized archive. Two different
> > things. The discussion here is whether to pass the initrd to the
> > kernel as one or the other. But noone is suggesting to convert one to
> > the other at boot time.
>
> I was referring to the following line from sd-stub's man page: "The
> following resources are passed as initrd cpio archives to the booted
> kernel: [...] /.extra/sysext/*.raw [...]". I assume the initrd
> containing the sysexts has to be created at some point?

These cpios are created on-the-fly and placed into memory and passed
to the invoked kernel. And yes, for that the data they contian needs
to be read off disk first.

Lennart

--
Lennart Poettering, Berlin


Re: Ton of random units "could not be found"

2023-12-16 Thread Lennart Poettering
On Fr, 15.12.23 22:17, chandler (s...@riseup.net) wrote:

> Hi all,
>
>     When I run `systemctl status --all` I see a ton of "Unit X could not
> be found" where X = an item from the list below.  How did this mess
> happen and how to clean it up?  None of these units represent things the
> system is using, for the most part.

This is not an issue. As Andrei already answered this just tells you
that some services have ordering deps against other units which aren't
installed, which is entirely fine. It's just metainfo that if you
install some packages in combination the right order is applied.

There's a reason why these entries are generally not shown. Except you
used "--all", which literally means "Hey, please also show me
*everything* you have heard about". Just drop the "--all" from your
command line.

>     Some units appear to be remnants left behind in /etc/systemd, for
> example /etc/systemd/system/ntp.service is a symlink pointing to
> non-existent /lib/systemd/system/ntpsec.service.  I can delete
> /etc/systemd/system/ntp.service and after `systemctl daemon-reload` it's
> now gone from the list below.

That smells like a packaging bug, you removed some package and it
forgot to invoke "systemctl disable" from it's pakaging uninstall
scripts first. File a bug against yout distro.

>     Other items have different situations, like tmp.mount exists at
> /usr/share/systemd/tmp.mount but isn't an enabled unit or anything, if I
> try to enable or unmask it I'm just told "Unit tmp.mount could not be
> found." or "Unit file tmp.mount does not exist."

/usr/share/systemd/ is not a directory systemd ever looks into for
unit files. If debian packaged something there, this smells like a
bug. Please report to your distro.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-13 Thread Lennart Poettering
On Di, 12.12.23 23:01, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> > sysexts are erofs or squashfs file systems with verity backing. Only
> > the sectors you access are decompressed.
>
> Okay I forgot that they were erofs based and mentioned cpio archives
> so I assumed they would be one.
> Do they need to be fully read from disk to generate the cpio archive?

erofs is a file system, cpio is a serialized archive. Two different
things. The discussion here is whether to pass the initrd to the
kernel as one or the other. But noone is suggesting to convert one to
the other at boot time.

Lennart

--
Lennart Poettering, Berlin


Re: networkd RetransmitSec - how to make it work on a host?

2023-12-12 Thread Lennart Poettering
On Mo, 11.12.23 02:49, Muggeridge, Matt (matt.muggerid...@hpe.com) wrote:

> The RetransmitSec option was introduced in systemd-v255, but I
> cannot get it to work for Neighbor Solicitations from a
> Host. Instead, I observe that the NS are always transmitted at 1
> second intervals, regardless of whether it was changed by:

Please file this as git issue. It sounds like a bug report, which
should really go to github.

Lennart

--
Lennart Poettering, Berlin


Re: systemd units disabled when calling systemctl daemon-reload

2023-12-12 Thread Lennart Poettering
On Di, 12.12.23 19:06, Etienne Cordonnier (ecordonn...@snap.com) wrote:

> Hello,
> I am debugging some embedded system running systemd. The behavior I am
> observing is that many systemd targets such as multi-user.target are
> disabled after I run systemctl daemon-reload (as shown by systemctl
> list-units --type target --all). This causes many systemd units to be
> disabled, and forces me to reboot the system.

What do you mean by "disabled"?

in systemd targets can be active and inactive, and that's what
"systemctl list-unit" shows. They can also be enabled/disabled, but
that's what "systemctl list-unit-files" shows. But targets such as
multi-user.target cannot be enabled nor disabled, they are considered
"static", i.e. always enabled if you so will. Which "systemctl
list-unit-file" should actually show.

Hence, I don#t really grok what you are trying to say here...

> Is there a way to debug this systemd target transition? I already
> enabled systemctl
> log-level debug, but I still don't understand why the systemd target is
> changing when I call systemctl daemon-reload on this particular system.

Please state OS, systemd version and provide relevant logs. Otherwise
this is not actionable.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Lennart Poettering
On Di, 12.12.23 21:34, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> Hi, while I have been following this thread passively for now I also
> wanted to chime in.
>
> > (The main reason why sd-stub doesn't actually support erofs-initrds,
> > is that sd-stub also generates initrd cpios on the fly, to pass
> > credentials and system extension images to the kernel, and you can't
> > really mix erofs and cpio initrds into one)
>
> What prevents one from mixing the two (especially given that the
> hypothetical erofs initrd support does not yet exist)?
> Or are you talking about mixing this with your memmap+root=/dev/pmem
> suggestion?

If you have 7 cpio initrds then the kernel will allocate a tmpfs and
unpack them all into it, one after the other, on top of each other,
and then jumps into the result.

if you have an erofs and 7 cpio initds, what are you going to do? You
cannot extract into an erofs, it's immutable. You'd need something
like overlayfs, but that would require (at least for now) an
additional step in userspace, which is something to avoid.

Alternatively (and preferred by me) would support a mode where it
would unpack any cpios it gets into a tmpfs, and then pass an fsopen()
fd to that to the executable it then invokes from the erofs. the
executable could then mount that somewhere if it wants. But this would
require a kenrel patch.

> Even if everything is the same there are codes paths which might not
> be taken during usual operation. An example would be services similar
> to the new systemd-bsod which are only triggered in emergencies.
> Having these in the cpio means that they will always be read and
> decompressed.

systemd-bsod is tiny though, less than 8K compressed here. Not sure it
is a good example.

> Using sysexts also has the drawback that each and every one of them
> has to be decompressed. I might be mistaken but I expect that this
> will be the case even if the extension-release in the sysext results
> in it being discarded which is obviously another big drawback.

sysexts are erofs or squashfs file systems with verity backing. Only
the sectors you access are decompressed.

Lennart

--
Lennart Poettering, Berlin


Re: IPv6 Compliance for networkd

2023-12-12 Thread Lennart Poettering
On Mo, 11.12.23 19:14, Muggeridge, Matt (matt.muggerid...@hpe.com) wrote:

> Hello, networkd developer community,
>
> I am hoping to rally support for making networkd IPv6 compliant and
> I'm will to help, but cannot do it alone. Is there any interest in
> making systemd-networkd IPv6 compliant?

Well, interest is relative. I think for most people IPv6 already works
well enough, they don't really care about compliance programs on this
so much. But as long as the requirements are reasonable they also
wouldn't mind (and prefer) if networkd passes those qualifications.

> There are many organizations (especially US Government) that mandate
> IPv6 compliance (USGv6).  Products that are dependent on networkd
> cannot be bid to these customers.

For the people currently involved with networkd upstream this is not a
top priority. If this is important to you however, that's great, we
are happy to review/merge patches.

> How do I engage with the right people in the developer community?

Send PRs via github.

> Thanks,
> Matt.
>
> PS: Mailing list topics go unanswered and github issues get lost in
> the noise, so I'm hoping there's a more efficient way to
> collaborate.

It's an Open Source project: if something matters a lot to you, then
please file PRs to get the work merged. We generally try to review PRs
sooner or later, but we are swamped with work, so it might take a
while. Just filing issues (while also appreciated) will usually
not magically make somebody work on this for you though. It's kinda
the same with most open source projects btw.

If this is something you'd like to see addressed soon, I'd recommend
maybe paying some consultancy (we have worked with codethink on some
projects, they should be willing to work on this, are capable and now
hot get stuff in systemd done).

If you don't have the cash for that, it might work to get funding from
this from organizations such as the German STF and things like
that. I am pretty sure that the US has something similar?

Anyway, judging by your email address I understand you work for HPE,
so I'd assume your company actually has the funds to payroll this
though, if this matters to you.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Lennart Poettering
On Mo, 11.12.23 17:03, Eric Curtin (ecur...@redhat.com) wrote:

> A generic approach is hard, I think it's worth discussing which type of boots
> you should actually care about milliseconds of performance for. It would be 
> nice
> if we had an init system that contained the binary data to do the minimum for
> standard Fedora, Debian installs and everything else was an extension whether
> that's sysexts, dlopen, a new binary to execute etc.
>
> If the network is ingrained in your boot stack like this, I'm
> guessing you probably don't care about boot performance.

Uh, I am not sure that's really true. People boot up VMs on demand,
based on network traffic. They sure care about latency and boot
times. I mean people care about firecracker and these things precisely
because it brings the of off-to-IP to a minimum.

> Automotive has an expectation for really fast boots, like 2 seconds, in 
> standard
> desktops installs there's some expectation as you interface directly
> with a human,
> but for other installs how much expectation is there?

AFAIR in particular in cars there's quite som functionality you
probaly want to move very early in boot. Which yells to me that you
want a service manager super early. Which again suggests to me that
the first initrd that runs should probably already cover that.

If I were you I'd probably focus on a design like this: ship a basic
systemd in an initrd. Complete enough to find the harddisk, and to run
the other services that are absolutely necessary this early. Then,
once you found the disk, look for sysext images on it, and apply them
all on top of the initrd's root fs you are already running with. Never
transition anywhere else.

The try to optimize the initrd a bit by making it an erofs/memmap
thing and so on. And make sure the initrd only contains stuff you
always need, so that reading it all into memory is necessary anyway,
and hence any approach that tries to run even the initrd off a disk
image won't be necessary becuase you need to read everything anyway.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Lennart Poettering
On Mo, 11.12.23 11:28, Demi Marie Obenour (d...@invisiblethingslab.com) wrote:

> I don't think this is "a pretty specific solution to one set of devices"
> _at all_.  To the contrary, it is _exactly_ what I want to see desktop
> systems moving to in the future.
>
> It solves the problem of large firmware images.  It solves the problem
> of device-specific configuration, because one can use a file on the EFI
> system partition that is read by userspace and either treated as
> untrusted or TPM-signed.  It means that one have a complete set of
> recovery tools in the event of a problem, rather than being limited to
> whatever one can squeese into an initramfs.  One can even include a full
> GUI stack (with accessibility support!), rather than just Plymouth.  For
> Qubes OS, one can include enough of the Xen and Qubes toolstack to even
> launch virtual machines, allowing the use of USB devices and networking
> for recovery purposes.  It even means that one can use a FIDO2 token to
> unlock the hard drive without a USB stack on the host.  And because the
> initramfs _only_ needs to load the boot extension volume, it can be
> very, _very_ small, which works great with using Linux as a coreboot
> payload.

systemd's "system extension" concept ("sysexts") already allow you to
do all that. The stuff I was fantasizing about would only change one
thing: instead of sd-stub from uefi mode already putting the sysexts
you installed into memory for the initrd to consume, it would be some
proto-initrd that would do so. This does not really change what you
can do with this, but mostly is just an optimization, reducing iops
and memory use a bit, and thus boot time latency.

> The only problem I can see that this does not solve is network boot, but
> that is very much a niche use case when compared to the millions of
> Fedora or Debian desktop installs, or even the tens of thousands of
> Qubes OS installs.  Furthermore, I would _much_ rather network boot be
> handled by userspace and kexec, rather than the closed source UEFI network
> stack.

Well, somebody's niche is somebody else's common case. In VM/cloud/server
scenarios network booting is not that "niche" as it might be on the desktop.

> It does require some care when upgrading, as the dm-verity image and the
> UKI cannot both be updated atomically, but one can solve that by first
> writing the new dm-verity image to a separate location.  The UKI will
> try both both the old and new locations for the dm-verity image and
> rename the new image over the old one on success.  The wrong image will
> simply fail to mount as its root hash will be wrong.

systemd-sysext already covers this just fine: you can encode in their
"extension-release" file to which base images they match up, and
systemd-syext will then find the right one to apply, and ignore the
others. Thus just make sure you drop in the sysexts fist, and the UKI
last and things should be perfectly robust.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Lennart Poettering
On Mo, 11.12.23 12:48, Eric Curtin (ecur...@redhat.com) wrote:

> Although the nice thing about a storage-init like approach is there's
> basically zero copies up front. What storage-init is trying to be, is
> a tool to just call systemd storage things, without also inheriting
> all the systemd stack.

Just to make this clear: using things like systemd-cryptsetup outside
of the systemd stack is not going to work once you leave trivial
setups. i.e. the TPM hookup involves multiple services these days, and
it's not going to get any simpler. i.e. systemd-tpm2-setup,
systemd-pcrextend, systemd-pcrlock and so on. I am sorry, but doing
reasonable disk encryption with TPM involved means you either buy into
the whole systemd offer (i.e. with the service manager) or you have to
rewrite your own systemd.

But maybe I am misunderstanding what you are saying here.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-12 Thread Lennart Poettering
On Mo, 11.12.23 12:48, Eric Curtin (ecur...@redhat.com) wrote:

> Sort of yes, but preferably using that __initramfs_start /
> initrd_start buffer as is without copying any bytes anywhere else and
> without teaching the bootloaders to do things.
>
> The "memmap=" approach you suggested sounds like what we are thinking,
> but do you think we could do this without teaching bootloaders to do
> new things?

Well, in a standard UEFI world it would suffice to teach the memmap=
logic to the stub that is glued in front of the kernel. For example,
make sd-stub find the erofs initrd in the UKI, then trivially
synthesize a memmap= switch and append it to the kernel command line.

but of course, you don't believe in UEFI or good boot loaders, so you
kinda dug your own grave here...

(The main reason why sd-stub doesn't actually support erofs-initrds,
is that sd-stub also generates initrd cpios on the fly, to pass
credentials and system extension images to the kernel, and you can't
really mix erofs and cpio initrds into one)

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Lennart Poettering
On Mo, 11.12.23 11:42, Eric Curtin (ecur...@redhat.com) wrote:

> I am also thinking, what is the difference between "make the
> bootloader load the erofs into contiguous memory" part and doing
> something like storage-init.

Well, from my PoV there's value in reducing the stages of the boot
process, and reducing the amount of storage stacks you need in the
mix. Hence, the boot loader can load stuff from disk into memory
anyway, it always has done that, typically the kernel and the
initrd. just swapping out the format of the initrd to get better
behaviour is relatively cheap there, means no additional storage
logic, no additional stage of the boot. You basically only have "boot
loader" (which loads kernel and initrd), and the "host os" (which runs
of the final rootfs).

Otoh if you let your storage-init load the initrd, then you basically
have a third step in the middle, which shares a lot of props with the
last step, but also is distinct. I mean, you probably would reinvent
your own udev and DM stack for that, to get verity in the mix (because
that depends on DM, and udev, to some degree)

In my ideal model, initrds are just part of the UKI btw, so they end
up being loaded together with the rest of the kernel, and need no
verity becaused signed along with the UKI itself.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Lennart Poettering
On Mo, 11.12.23 11:28, Eric Curtin (ecur...@redhat.com) wrote:

> > > For the items listed above I think you can find different solutions
> > > which do not necessarily compromise security as much.
> > >
> > > So, in the list above you could address the latter three like this:
> > >
> > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot
> > >loader load the erofs into contigous memory, then use memmap=X!Y on
> > >the kernel cmdline to synthesize a block device from that, which
> > >you then mount directly (without any initrd) via
> > >root=/dev/pmem0. This means yout boot loader will still load the
> > >whole image into memory, but only decompress the bits actually
> > >neeed. (It also has some other nice benefits I like, such as an
> > >immutable rootfs, which tmpfs-based initrds don't have.)
>
> What I am unsure about here, is the "make the bootloader load the
> erofs into contiguous memory" part. I wonder could we try and use the
> existing initramfs data as is.

Today's initrds are packed cpio archives of an OS file system
hierarchy. What I proposed means you'd have to put the OS file system
hiearchy into an erofs image instead. Which is a trivial operation,
just unpack and repack.

Note that there are two concepts of "initrd" out there.

a) from the kernel perspective an initrd/initramfs (which both are
   badly named, because its a tmpfs these days) is that packed cpio
   archive that is unpacked into a tmpfs, and then jumped into.

b) from systemd's perspective an initrd is an OS image that carries an
   /etc/initrd-release file. If that file exists then systemd will not
   boot up the system regularly, but instead just prepare everything
   that it can transition into some other root fs.

While most often in real life the initrds currently qualify under both
definitions. But there's no reason to always do this. You can also
have images the kernel would consider an initrd, but systemd does not,
which is something we use in the "USI" concept, i.e. "unified system
images", which are basically UKIs (large UKIs) with a complete rootfs
that is the main system of the OS. And you can also do it the other
way round, which is potentially what I am suggesting to you here: use
an erofs image that would not be considered an initrd by the kernel,
but that systemd would consider one, and transition out of.

> I dunno if
> bootloaders make much assumptions about the format of that data, worst
> case scenario we could encapsulate erofs in the initramfs, cpio looking
> data.

boot loaders generally don't bother with the cpio, it's just "data"
for them. Compression algorithms have changed in the past, and it only
mattered that the kernel could decompress it, the boot loader doesn't care.

> Teach the kernel not to decompress and process the whole
> thing and mount it like an erofs alternatively. Does this sound crazy
> or reasonable?

You are re-inventing the traditional "initrd" logic of the kernel
which was a ramdisk (i.e. a block device /dev/ram0), that was filled
with some fs of your choice loaded by the boot loader.

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Lennart Poettering
On Mo, 11.12.23 10:57, Lennart Poettering (mzerq...@0pointer.de) wrote:

> Which leaves item 1, which is a bit harder to address. We have been
> discussing this off an on internally too. A generic solution to this
> is hard. My current thinking for this could be something like this,
> covering the UEFI world: support sticking a DDI for the main initrd in
> the ESP. The ESP is per definition unencrypted and unauthenticated,
> but otherwise relatively well defined, i.e. known to be vfat and
> discoverable via UUID on a GPT disk. So: build a minimal
> single-process initrd into the kernel (i.e. UKI) that has exactly the
> storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs
> drivers, and dm-verity. Then have a PID 1 that does exactly enough to
> jump into the rootfs stored in the ESP. That latter then has proper
> file system drivers, storage drivers, crypto stack, and can unlock the
> real root. This would still be a pretty specific solution to one set
> of devices though, as it could not cover network boots (i.e. where
> there is just no ESP to boot from), but I think this could be kept
> relatively close, as the logic in that case could just fall back into
> loading the DDI that normally would still in the ESP fully into
> memory.

BTW, one thing I would like to emphasize though. i think this item is
really the last thing you should focus on. If your OS never
transitions out of the initrd, and gets its payload merged in via
DDIs, then the root fs should be reasonably small enough and "fully
used at boot" (i.e. every sector read anyway) that doing this extra
work of finding a split-out DDI on the ESP is entirely unnecessary and
just a waste of time (both of developer time and boot time).

Lennart

--
Lennart Poettering, Berlin


Re: [RFC] initoverlayfs - a scalable initial filesystem

2023-12-11 Thread Lennart Poettering
a PID 1 that does exactly enough to
jump into the rootfs stored in the ESP. That latter then has proper
file system drivers, storage drivers, crypto stack, and can unlock the
real root. This would still be a pretty specific solution to one set
of devices though, as it could not cover network boots (i.e. where
there is just no ESP to boot from), but I think this could be kept
relatively close, as the logic in that case could just fall back into
loading the DDI that normally would still in the ESP fully into
memory.

(If you are focussing on systems lacking UEFI, then replace the word
"ESP" in the above with a similar concept, i.e. a well discoverable,
unauthenticated relatively simple file system, such as vfat).

Anyway, I can't tell you how to solve your specific problems, but if
there's one thing I'd suggest you to keep in mind then it's the
security angle, i.e. keep in mind from the beginning how
authentication of every component of your process shall work, how
unatteneded disk encryption shall operate and how measurement shall
work. Security must be built into things from the beginning, not be
added as an afterthought.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Manual start of user@.service failed with permission denied

2023-12-08 Thread Lennart Poettering
On Fr, 08.12.23 08:52, Christopher Wong (christopher.w...@axis.com) wrote:

> Hi Lennart,
>
> I know we are not using the pam_systemd. That is the reason we try
> to run the steps manually. It was possible to start the
> user@.service in systemd v253, but it fails now with v254 or
> later.

Well, that's not supported then. You need XDG_RUNTIME_DIR set up
properly, and that's what the PAM module gives you. If you turn off
the PAM module then you get to keep the pieces, you voided your
warranty.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Manual start of user@.service failed with permission denied

2023-12-07 Thread Lennart Poettering
On Do, 07.12.23 18:29, Christopher Wong (christopher.w...@axis.com) wrote:

> Hi Lennart,
>
> We are doing the steps to start up a rootless docker. If I don’t set 
> XDG_RUNTIME_DIR then I will get the below error:
>
> systemd[1925]: Trying to run as user instance, but $XDG_RUNTIME_DIR
> is not set.

pam_systemd is responsible for setting this env var. Most likely you
are missing that from the PAM stack that is used by this user@.service
instance?

> The 503 is a system user. So, just to try it out, I created a user,
> which got the UID 1001. Using that UID gave me the same result as
> the 503.

It's a bad idea to run user stuff as system user.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Manual start of user@.service failed with permission denied

2023-12-06 Thread Lennart Poettering
On Mi, 06.12.23 14:46, Christopher Wong (christopher.w...@axis.com) wrote:

> Hi,
>
> I’m trying to do the following:
>
> root@host:~# systemctl set-environment
> XDG_RUNTIME_DIR="/run/user/503"

Why would you do that?

user@.service automatically pulls in user-runtime-dir@.service which
is responsible for creating that dir with right perms.

is 504 a system user? or a regular user?

systemd generally assumes the boundary between system and regular
users is between 999 and 1000.

But user@.service is really just for regular users, not system users,
hence my question.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] How to debug systemd-pcrphase-initrd.service failure

2023-12-06 Thread Lennart Poettering
On Mi, 06.12.23 18:28, Renjaya Raga Zenta (ragaze...@gmail.com) wrote:

> Hi,
>
> I am exploring OS image building with mkosi. It works great until I add TPM
> 2.0 in qemu.
>
> I found that the systemd-pcrphase-initrd.service failed. There are 3
> pcrphase service:
>
> 1. systemd-pcrphase-initrd.service (failed)
> 2. systemd-pcrphase.service (ok)
> 3. systemd-pcrphase-sysinit.service (ok)

So the latter two run from the host fs, the first one from the initrd fs.

> Related journal log:
> systemd[1]: Failed to start systemd-pcrphase-initrd.service - TPM2 PCR
> Barrier (initrd).
> ...
> systemd-pcrphase[130]: Failed to load TPM2 libraries: Operation not
> supported
> ...

It appears you are lacking the tpm2-tss libraries in your initrd image.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] how to use systemd-sysext addons and systemd-stub to extend an UKI initrd

2023-12-05 Thread Lennart Poettering
On Mo, 04.12.23 17:40, Emanuele Giuseppe Esposito (eespo...@redhat.com) wrote:

> Hello everyone,
>
> As the title suggests, I am trying to extend an UKI initrd via
> systemd-sysext addons/extensions.
>
> I contributed to the systemd-stub UKI addons to extend the kernel
> command line, so I know how they works and planning to give a talk about
> them soon. However, I would like to get the full picture by using the
> same mechanism but with systemd-sysext addons to extend also initrd.
>
> As I understood, a systemd-sysext addon in
> /boot/efi/EFI/Linux/.efi.extra.d will be put in /.extra/sysext
> by systemd-stub, and then will be picked up by systemd-sysext to be
> added into the initrd.
>
> I am using Fedora, I created my UKI devel.efi, and made sure (just for
> safety) that the initrd contains the systemd-sysext module, as I
> generated it with dracut.
>
> The UKI is created with freshly compiled systemd-stub from commit
> 5808300c44. Kernel is 6.6.0-0.rc1.20230915git9fdfb15a3dbf.17.fc40.x86_64
>
> Then, I created a super dumb extension and put it in the right location:
> mkdir extension
> cd extension/
> vi ciao.txt
> mkdir usr
> cp ciao.txt usr/ciao2.txt
> cat /etc/os-release
> mkdir -p usr/lib/extension-release.d/
> echo ID=fedora > usr/lib/extension-release.d/extension-release.extension
> echo VERSION_ID=40 >>
> usr/lib/extension-release.d/extension-release.extension
> cat usr/lib/extension-release.d/extension-release.extension
> cd ..
> mksquashfs extension extension.raw
> mv extension.raw /boot/efi/EFI/Linux/devel.efi.extra.d/

The image must come with verity + signature, we'll not allow unsigned
extensions by default. (you could relax the image policy if you want,
or disable it but I'd advise you not to. The env var
SYSTEMD_DISSECT_VERITY_SIGNATURE=0 tells sysext to not validate images)

With upcoming systemd v255 just use "systemd-repart --make-ddi=sysext"
to generate a sysext image with verity and signing. mkosi can help you
too.

You either need to install your signature public key in the kernel's
own keychain somehow, or drop suitable certficates into
{/etc,/run}/verity.d/*.crt. The latter is a bit
underdocumented. (There was hope we could drop this again because it
would become easier to install stuff into the kernel keychain, but
that's still a mess, hence this userspace validation is probably going
to stay for good).

Ultimately if distros ship this in final products they really should
use the kernel keyring for this. That's how MSFT uses this for
example.

> Supposing I manage to do all of the above, my next question would be
> how/if to override the /lib folder instead of the traditional /usr or
> /opt, as for example I might want to add another kernel module into
> the UKI.

/lib/ is 1990's Linux. On modern distros, such as Fedora it has long
been replaced by a symlink to /usr/lib/. Hence if you want to drop
stuff into /lib/ then just drop it into /usr/lib/ instead.

> Last but not least is where is the documentation for this. I couldn't
> find anything at all about systemd-sysext, and therefore I would be very
> very happy to write (other than presenting it) some doc to make the life
> easier to anyone like me that is looking forward to using these new
> features.

So there's the man page of systemd-sysext and systemd-repart.

Flatcar has some docs:
https://www.flatcar.org/docs/latest/provisioning/sysext/

There is a video from ASG how this fits together:

https://www.youtube.com/watch?v=XTy3scX6rF4

There's no tutorial how to put this together though. Contributing that
would be very welcome of course!

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Where to install UKI cmdline addons in the root partition

2023-12-05 Thread Lennart Poettering
On Mo, 04.12.23 17:48, Emanuele Giuseppe Esposito (eespo...@redhat.com) wrote:

> Hello everyone,
>
> Sorry for the back-to-back emails, but I realized I could use this
> mailing list to bring up another topic related to UKI addons.
>
> This is the same as I wrote in
> https://github.com/systemd/systemd/issues/29372 : I think we need some
> agreement to decide that if distros want to ship rpms containing default
> signed UKI addons, they should all go in the same place in the root
> partition.
>
> By putting them there, we offer the user the possibility to keep the ESP
> clean and lightweight (as there is not much space available in there
> IIRC), and the user can simply cp the addons from the root partition
> into the desired ESP to install the addon, and rm to remove them.
>
> But I still think it is important to have some agreement, and document
> it somewhere.
>
> What do you think?

I commented on the github issue. At this time I think more people are
subscribed to that than watch this ML.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Configure netdev RPS using systemd-networkd

2023-12-04 Thread Lennart Poettering
On Mo, 04.12.23 14:59, Renjaya Raga Zenta (ragaze...@gmail.com) wrote:

> Hi,
>
> We want to implement our networking using systemd-networkd. We think
> systemd is stable enough right now, so we want to try more "systemd-only"
> solution.
>
> In our environment, we use RPS (Receive Packet Steering) for load balancing
> and scaling. It's a kernel feature implemented a long time ago. You could
> visit the documentation at
> https://www.kernel.org/doc/html/latest/networking/scaling.html.
>
> Currently, we manually do this after network interface is configured:
>
> echo f > /sys/class/net/eth0/queues/rx-0/rps_cpus
>
> where f is bitmap mask , it means to utilize 4 cpus.
>
> Will this use case be implemented in systemd-networkd? Or should we use a
> third party solution such as networkd-broker or networkd-dispatcher?

I see no reason why we wouldn't add a high-level option for this to
.link files.

We are happy to review/merge a patch. Please submit via GitHub.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd: questions about dbus dependency service

2023-12-04 Thread Lennart Poettering
On Mo, 04.12.23 13:01, Pintu Agarwal (pintu.p...@gmail.com) wrote:

> Hi,
> Any comments or suggestions on the below ?

I already replied.

https://lists.freedesktop.org/archives/systemd-devel/2023-November/049706.html

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Systemd-nspawn single process

2023-12-01 Thread Lennart Poettering
On Fr, 01.12.23 14:03, Warex61 YTB (thomasdabou...@gmail.com) wrote:

> Hello,
> I would like to use systemd-nspawn to create a container that can launch a
> single process as pid 1 and mount its configuration files. I want the
> container to be as light as possible. Is there any way of creating a
> container using nspawn without using bootstrap ?
>
> For example, using this command, without using a bootstrap
>
> systemd-nspawn -M process -D /etc/systemd/nspawn/process
> /etc/systemd/nspawn/process.nspawn
> I get the following error
>
> Directory /etc/systemd/nspawn/process doesn't look like it has an OS tree.
> Refusing.
> What are the conditions for nspawn to consider an OS tree in
> /etc/systemd/nspawn/process ?

You are using an ancient version of nspawn. Since 2y or so the message
reads:

Directory %s doesn't look like it has an OS tree (/usr/ directory is 
missing). Refusing.

And that's your explanation: you need an /usr/ directory.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] How to properly wait for udev?

2023-11-29 Thread Lennart Poettering
On Mo, 27.11.23 21:32, Richard Weinberger (richard.weinber...@gmail.com) wrote:

> On Mon, Nov 27, 2023 at 9:29 AM Lennart Poettering
>  wrote:
> > If they conceptually should be considered block device equivalents, we
> > might want to extend the udev logic to such UBI devices too.  Patches
> > welcome.
>
> Why doesn't udev flock() every device it is probing?
> Or asked differently, why is this feature opt-in instead of opt-out?

Some software really doesn't like it if we take BSD locks on their
devices, hence we don't take it blanket everywhere. And what's more
important even: for various devices it simply isn't safe to just
willy-nilly even open them (tape drivers and things, which might start
to pull in a tape if we do). For others we might not be able to even
open thing at all with the wrong flags (for example, because they are
output only).

Bock devices have relatively well defined semantics, there it's
generally safe to do this, hence we do.

Hence, it might be safe for UBI, but for the general case it might
not be.

That said, would BSD locking even address your issue? If you devices
are exclusive access things and we first open() them and then flock()
them, then that's not atomic. So if your test cases open the devices,
then flock() them you might still get into conflict udev because it
just open()ed the device, but didn#t get to call flock() yet.

Doesn't UBI have something like O_EXCL-behaviour that grants true
exclusive access?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd: questions about dbus dependency service

2023-11-28 Thread Lennart Poettering
On Di, 28.11.23 22:48, Pintu Agarwal (pintu.p...@gmail.com) wrote:

> Hi,
>
> I need some clarification about systemd services that are dependent on dbus
> service.
>
> We have a service that depends on dbus.service, so our service has to be
> started after dbus.socket and dbus.service.

It's usually a good idea to not wait for dbus.sevice. Waiting for
dbus.socket is sufficient, it makes sure clients can connect to D-Bus
(even if dbus needs to finish starting up to respond to it). This will
increase parallelization during boot.

> But dbus.service comes after local-fs.target and sysinit.target.
> However, our service needs to be started very early on boot-up, maybe
> within local-fs target itself, otherwise it is causing regression in our
> boot KPI.

dbus is not a suitable IPC for early boot services, unless you speak
the dbus protocol directly between client and service, without
involving the broker. But that's messy.

systemd's PID 1 does this (i.e. dbus without a broker), because it
must be accessible early on, but I hate that code, and I'd rather kill
it. In new code that must run in early boot we usually use a different
IPC (varlink), that does not involve any broker, and thus always works.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] How to properly wait for udev?

2023-11-27 Thread Lennart Poettering
On So, 26.11.23 00:39, Richard Weinberger (richard.weinber...@gmail.com) wrote:

> Hello!
>
> After upgrading my main test worker to a recent distribution, the UBI
> test suite [0] fails at various places with -EBUSY.
> The reason is that these tests create and remove UBI volumes rapidly.
> A typical test sequence is as follows:
> 1. creation of /dev/ubi0_0
> 2. some exclusive operation, such as atomic update or volume resize on
> /dev/ubi0_0
> 3. removal of /dev/ubi0_0
>
> Both steps 2 and 3 can fail with -EBUSY because the udev worker still
> holds a file descriptor to /dev/ubi0_0.

Hmm, I have no experience with UBI, but are you sure we open that? why
would we? are such devices analyzed by blkid? We generally don't open
device nodes unless we have a reason to, such as doing blkid on it or
so.

What precisely fails for you? the open()? or some operation on the
opened fd?

>
> FWIW, the problem can also get triggered using UBI's shell utilities
> if the system is fast enough, e.g.
> # ubimkvol -N testv -S 50 -n 0 /dev/ubi0 && ubirmvol -n 0 /dev/ubi0
> Volume ID 0, size 50 LEBs (793600 bytes, 775.0 KiB), LEB size 15872
> bytes (15.5 KiB), dynamic, name "testv", alignment 1
> ubirmvol: error!: cannot UBI remove volume
>  error 16 (Device or resource busy)
>
> Instead of adding a retry loop around -EBUSY, I believe the best
> solution is to add code to wait for udev.
> For example, having a udev barrier in ubi_mkvol() and ubi_rmvol() [1]
> seems like a good idea to me.

For block devices we implement this:

https://systemd.io/BLOCK_DEVICE_LOCKING

I understand UBI aren't block devices though?

If they conceptually should be considered block device equivalents, we
might want to extend the udev logic to such UBI devices too.  Patches
welcome.

We provide "udevadm lock" to lock a block device according to this
scheme from shell scripts.

> What function from libsystemd do you suggest for waiting until udev is
> done with rule processing?
> My naive approach, using udev_queue_is_empty() and
> sd_device_get_is_initialized(), does not resolve all failures so far.
> Firstly, udev_queue_is_empty() doesn't seem to be exported by
> libsystemd. I have open-coded it as:
> static int udev_queue_is_empty(void) {
>return access("/run/udev/queue", F_OK) < 0 ?
>(errno == ENOENT ? true : -errno) : false;
> }

This doesn't really work. udev might still process the device in the
background.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Does coredumpctl info support minidebuginfo / gnu_debugdata ?

2023-11-17 Thread Lennart Poettering
On Do, 16.11.23 18:37, Etienne Cordonnier (ecordonn...@snap.com) wrote:

> Hello,
> I am testing a yocto based system, where it seems that "coredumpctl info"
> isn't able to use minidebuginfo / gnu_debugdata to extract a symbolized
> call-stack. I saw in the code that coredumpctl uses elfutils / libdwfl in
> order to extract a call-stack, and as far as I understand libdwfl supports
> minidebuginfo since this commit (
> https://sourceware.org/git/?p=elfutils.git;a=commit;h=5083a70d3b64946fa47ea5766943a15a3ecc6891
> ).
>
> Is there a configuration / build-option / etc. to enable support for
> minidebuginfo in coredumpctl? If no is it on the roadmap? The advantage of
> minidebuginfo is that it is much smaller than full debug symbols.

Fedora has been using minidebuginfo since ~10y or so, and
coredumctl/libdwfl has been working fine with it. So it certainly
works, it's how this all works on my local machine since forever.

Maybe ask your distro for help, it's generally an integration issue of
distributions i this doesn't work.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Low memory dbus signal for GMemoryMonitor

2023-11-13 Thread Lennart Poettering
On Di, 14.11.23 15:00, Kate Hsuan (h...@redhat.com) wrote:

> Hi Folks,

Hi!

> Could systemd detect the system's low memory status and send a signal
> through Dbus about low memory events?

We already have an interface for this, it's documented here:

https://systemd.io/MEMORY_PRESSURE

It doesn't operate via D-Bus however, but instead just tells apps how
to directly get the events from the kernel. That's generally better
than bumping the events off two daemons (i.e. a memory pressure daemon
and a dbus broker), simply because memory pressure is a problem of
latency, and you should not add additional steps to the notifications
if you want to make things better and not worse. Moreover, on memory
pressure you shouldn't allocate more memory, which is something the
indirection through a daemon and broker would typically mean.

> We are looking for a new backend for GMemoryMonitor.
> https://developer-old.gnome.org/gio/stable/GMemoryMonitor.html
>
> The original backend- low-memory-monitor monitors the system memory
> usage. When it detects the memory is lower than a level, it signals
> the application. It also manages the kernel OOM.

It should be possibly to implement a GMemoryMonitor on top of the
kernel APIs directly, using the information systemd gives you. See the
documentation. It even briefly mentions GMemoryMonitor at the end.

If you have any questions about details, feel free to ask!

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Help! iSCSI based file systems with "_netdev" causing ordering cycles to occur (random services and mounts fail)

2023-10-30 Thread Lennart Poettering
On Mo, 30.10.23 10:17, Lennart Poettering (lenn...@poettering.net) wrote:

> On Fr, 27.10.23 20:46, Tony Rodriguez (unixpro1...@gmail.com) wrote:
>
> > Andrea asked for more details so I have provide this verbose output.
> >
> > 1) Lennart's recommendation of removing "/tmp" within /etc/fstab and using
> > tmpfs for "/tmp" appears to stop the dependency issue for systemd-239 for
> > systemd-252.  However, RH8 and RH9 don't support systemd-networkd, I am
> > wondering how this can be overcome if removing "/tmp" and using "tmpfs"
> > aren't options?  Would I have to modify various services and targets? What
> > would I need to add or remove within services and targets to avoid these
> > dependencies?
>
> This is something you'd have to ask your OS vendor. If they don't
> support netwokrd, they will support something else, and maybe it has a
> way to run in early boot or initrd.
>
> Booting without /tmp/ mounted during early boot is simply not
> supported from upstream, sorry, can't help you there. Please contact
> your OS vendor if they can help you.
>
> > 2) On another note, with RH9 systemd-252-14/systemd-252-18 and iscsi, new
> > dependency issues occur if "_netdev" within /etc/fstab is specified for
> > "/var" or "/usr".
>
> If /usr/ is split off it *must* be mounted even earlier than /tmp/: it
> must be mounted in the initrd, nothing else is supported, sorry.
>
> If /var/ is split off it must be mounted at the same point as /tmp/,
> i.e some time in early boot, not necessarily in the initrd though.

Since we never documented this properly I have put together another
piece of documentation that summarizes the requirements on mounts and
when they must be available during boot:

https://github.com/systemd/systemd/pull/29761

You can see the rendered version here (until the next PR push that is)

https://github.com/systemd/systemd/blob/87828aae4712bdb300101b05911392c43d081a6b/docs/MOUNT_REQUIREMENTS.md

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Help! iSCSI based file systems with "_netdev" causing ordering cycles to occur (random services and mounts fail)

2023-10-30 Thread Lennart Poettering
On Fr, 27.10.23 20:46, Tony Rodriguez (unixpro1...@gmail.com) wrote:

> Andrea asked for more details so I have provide this verbose output.
>
> 1) Lennart's recommendation of removing "/tmp" within /etc/fstab and using
> tmpfs for "/tmp" appears to stop the dependency issue for systemd-239 for
> systemd-252.  However, RH8 and RH9 don't support systemd-networkd, I am
> wondering how this can be overcome if removing "/tmp" and using "tmpfs"
> aren't options?  Would I have to modify various services and targets? What
> would I need to add or remove within services and targets to avoid these
> dependencies?

This is something you'd have to ask your OS vendor. If they don't
support netwokrd, they will support something else, and maybe it has a
way to run in early boot or initrd.

Booting without /tmp/ mounted during early boot is simply not
supported from upstream, sorry, can't help you there. Please contact
your OS vendor if they can help you.

> 2) On another note, with RH9 systemd-252-14/systemd-252-18 and iscsi, new
> dependency issues occur if "_netdev" within /etc/fstab is specified for
> "/var" or "/usr".

If /usr/ is split off it *must* be mounted even earlier than /tmp/: it
must be mounted in the initrd, nothing else is supported, sorry.

If /var/ is split off it must be mounted at the same point as /tmp/,
i.e some time in early boot, not necessarily in the initrd though.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Help! iSCSI based file systems with "_netdev" causing ordering cycles to occur (random services and mounts fail)

2023-10-27 Thread Lennart Poettering
On Do, 26.10.23 19:03, Tony Rodriguez (unixpro1...@gmail.com) wrote:

> Experiencing this same issue with iSCSI and systemd-239 for RH8/Rocky8 and
> RH9/Rocky9 system-252. Nothing was done on my end to create this issue.  In
> other words, no custom mount/unit files or services, just your typical ISO
> install and rpm updates.
>
> An ordering cycle occurs, when "_netdev" is specified within /etc/fstab for
> systemd.  This happens with systemd-239-14 and systemd-239-18 using iSCSI
> based file systems.    Seems others are experiencing this as well (see link
> below).  I can also confirm this happens with systemd-252 (RH9/Rocky9)l.
> Especially if "_netdev" is used with either "/var" or "/usr" iSCSI based
> devices/file systems.  The system may not boot, may not mount file systems,
> may not start services/unit files, and the system becomes slow during system
> boot.
>
> Does anyone know of a fix/patch and root cause for this?
>
> Please see this link:
> https://issues.redhat.com/browse/RHEL-12987?jql=project%20%3D%20RHEL%20AND%20affectedVersion%20%3D%20rhel-9.2.0%20AND%20text%20~%20%22iscsi%22
>
> # cat /etc/fstab
> [...]
> /dev/mapper/rhel-root /   xfs defaults,_netdev 0 0
> UUID=2177a7fc-bc41-43e4-bdc1-d231a5eb4680 /boot xfs defaults,_netdev 0 0
> /dev/mapper/rhel-tmp /tmp    xfs defaults,_netdev 0 0
> /dev/mapper/rhel-var /var    xfs
> defaults,_netdev,x-initrd.mount 0 0
> /dev/mapper/rhel-var_log /var/log    xfs defaults,_netdev 0 0
> /dev/mapper/rhel-var_tmp /var/tmp    xfs defaults,_netdev 0 0
>
> # journalctl -b | grep deleted
> Oct 13 08:15:35 vm-isci8 systemd[1]: basic.target: Job tmp.mount/start
> deleted to break ordering cycle starting with basic.target/start
> Oct 13 08:15:35 vm-isci8 systemd[1]: network.target: Job
> network-pre.target/start deleted to break ordering cycle starting with
> network.target/start
> Oct 13 08:15:35 vm-isci8 systemd[1]: NetworkManager.service: Job
> dbus.socket/start deleted to break ordering cycle starting with
> NetworkManager.service/start

/tmp must be available during early boot already, and your
NetworkManager service is apparently a late boot service. Hence you
have a cycle: you want that /tmp/ is mounted after the network, but
your network is configured really late. But /tmp is necessary during
early boot. BOOM!

Two ways out:

1. Don't make /tmp an iscsi mount. Bad idea anyway. Just use tmpfs for
   it, like everyone else.

2. Upgrade to a better network management solution that has no
   problems with running in early boot, for example systemd-networkd.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] How to use systemd-growfs* services with GPT automount

2023-10-25 Thread Lennart Poettering
On Di, 24.10.23 23:48, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> > On Mo, 23.10.23 02:00, Nils Kattenbeck (nilskem...@gmail.com) wrote:
> >
> > > Hello,
> > >
> > > I am not sure how to get systemd-growfs-root.service to work with
> > > automount. The partitions are configured via systemd-repart (and the
> > > image created using mkosi). While the partitions are correctly grown
> > > upon boot, the contained filesystem is not grown to match the
> > > partition even though GrowFileSystem defaults to true. Is there
> > > anything I am missing or an easy way to troubleshoot this and get more
> > > information?
> > >
> > > One thing I notice is that the generator.late/-.mount unit has a
> > > Options=ro which as per documentation prevents growing the filesystem.
> > > However, the filesystem is actually mounted read-write so I assume
> > > this is just an artifact of the initrd. Is it not possible to grow the
> > > filesystem from which the initrd starts?
> >
> > Do you have "ro" or "rw" on the kernel cmdline?
>
> I have neither set on the cmdline.

if you add it, does it work?

ro/rw is a bit weird. Usually in our configuration model the settings
on the kernel cmdline args take precedence over config in
/etc/. But ro/rw is different for historical reasons: it only
specifies the initial ro/rw state of the disks, expecting that
/etc/fstab later changes things to the final setting. And if neither
are specified we imply "ro".

Hence, you have two choices: define an /etc/fstab (which of course is
not what you want with gpt-auto) or just add "rw" to the kernel cmdline.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] How to get Credential into Environment variable?

2023-10-24 Thread Lennart Poettering
On Di, 26.09.23 04:39, chandler (s...@riseup.net) wrote:

> Hi all,
>
>     I'm not quite grasping something here... I've just learned about
> `systemd-creds` and now trying to utilize it with a service which
> depends on a secret stored in an environment variable (or passed as a
> CLI option).
>
> Normally I could use a line like:
>
> `Environment=SEC=1234`
>
> Now I've:
>
> 1) Given "1234" to `systemd-ask-password -n | systemd-creds encrypt
> --name=secret --pretty - -`
> 2) Put the resulting `SetCredentialEncrypted=secret: ...` under the
> [Service] section
> 3) Failing with `Environment=SEC=%d/secret`
>
> Now `SEC=/run/credentials/myService.service/secret` but I need the value
> from the file, which I verified with a simple `ExecStart=checkEnv.sh`
> which runs `cat ${SEC}` which prints `1234`.
>
> Also tried putting the secret, e.g. "1234", into a temp file `/tmp/sec`
> and ran:
>
> `systemd-creds encrypt --name=secret --pretty /tmp/sec -`
>
> but the results are the same.
>
> How to get `SEC=1234` basically?

The credentials logic is supposed to be used *in* *place* of
environment variables. Environment variables are an awful way to pass
credentials to services, since their are inherited down the process
tree even across security boundaries by default, and there's zero
access control on them. (and they are not really compatible with
binary data or larger data, and so on)

Hence, what you are looking for is not supported and we won't support
it, because it defeats one main design goal of credentials: to require
access control on access, and not allow "greedy" inheritance down the
process tree.

Sorry if that's disappointing!

If a service insists on reading its credentials from an env var or
cmdline and supports nothing else this is of course disappointing, but
it's simply not compatible with the credentials logic, without manual
glue scripting.

We generally recommend that services support reading the credentials
from files (in which case you can point them to
$CREDENTIALS_DIRECTORY/ to read what they want), or even better:
just natively support credentials by looking at $CREDENTIALS_DIRECTORY
on their own, without being told so.

If you have an app that doesn't allow either, and really and only
wants env vars or cmdline params, then you can script around this,
with a script like this:

```c
#!/bin/bash

read -r MYCRED < "$CREDENTIALS_DIRECTORY"/mycred
export MYCRED

exec mybinary
```

you get the idea.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] How to use systemd-growfs* services with GPT automount

2023-10-24 Thread Lennart Poettering
On Mo, 23.10.23 02:00, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> Hello,
>
> I am not sure how to get systemd-growfs-root.service to work with
> automount. The partitions are configured via systemd-repart (and the
> image created using mkosi). While the partitions are correctly grown
> upon boot, the contained filesystem is not grown to match the
> partition even though GrowFileSystem defaults to true. Is there
> anything I am missing or an easy way to troubleshoot this and get more
> information?
>
> One thing I notice is that the generator.late/-.mount unit has a
> Options=ro which as per documentation prevents growing the filesystem.
> However, the filesystem is actually mounted read-write so I assume
> this is just an artifact of the initrd. Is it not possible to grow the
> filesystem from which the initrd starts?

Do you have "ro" or "rw" on the kernel cmdline?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Support for loading Multiple DTBs from UKI image

2023-10-11 Thread Lennart Poettering
On Mi, 11.10.23 10:00, VENKAT Nagaraj THOGARU (QUIC) (quic_thog...@quicinc.com) 
wrote:

> Hi 
> @systemd-devel@lists.freedesktop.org<mailto:systemd-devel@lists.freedesktop.org>,
>
>
> Is there any support for parsing Multiple DTBs and selecting appropriate DTB 
> from UKI image in system-boot?
> Or is there any UEFI interface hook to implement such a change in UEFI to 
> make a selection of DTB, just like DT_FIXUP ?

There's a PR for this:

https://github.com/systemd/systemd/pull/28959

But it hasn't seen progress in the past 3 weeks.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Help! Reached target Local File Systems order is incorrect

2023-10-10 Thread Lennart Poettering
On Mo, 09.10.23 12:07, Tony Rodriguez (unixpro1...@gmail.com) wrote:

> Created a service that invokes a "systemctl daemon-reload". Goal is for a
> reload to occur early in the boot process, before other user made services
> are invoked.  During additional testing, sometimes it is correct and other
> times it is out of order (incorrect -  See steps C).  It may work for 5 or 6
> times after each reboot/shutdown, then randomly become incorrect. How can I
> make this more consistent? Already tried various keyword combinations
> (wants,before,after, etc) within the unit file, all without luck.
> Thought about something like "After=var.mount, etc" as well, but that is
> inflexible because I will not know filesystems users may create.
>
> A) Unit file
>
> [Unit]
> Description=Systemctl-Reload
> Wants=local-fs.target
> DefaultDependencies=yes
>
> [Service]
> Type=oneshot
> RemainAfterExit=yes
> ExecStart=/bin/systemctl daemon-reload
>
> [Install]
> WantedBy=local-fs.target
>
> B)  Correct order: ** Reached target Local File Systems is after all
> mounting is done. Sometimes it works.

You have not defined any order in the unit file. i.e. not After= nor
Before=. Hence it's going to be executed as quickly as possible during
boot.

See docs:

https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Before=

Generally though it's recommended not to reload PID 1 configuration
during the initial transaction if avoidable. Better approaches are to
put together generators or so, which can augment the set of units and
their dependencies already when the first transaction is put together.

https://www.freedesktop.org/software/systemd/man/systemd.generator.html

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Systemd cgroup setup issue in containers

2023-09-29 Thread Lennart Poettering
On Fr, 29.09.23 10:53, Lewis Gaul (lewis.g...@gmail.com) wrote:

> Hi systemd team,
>
> I've encountered an issue when running systemd inside a container using
> cgroups v2, where if a container exec process is created at the wrong
> moment during early startup then systemd will fail to move all processes
> into a child cgroup, and therefore fail to enable controllers due to the
> "no internal processes" rule introduced in cgroups v2. In other words, a
> systemd container is started and very soon after a process is created via
> e.g. 'podman exec systemd-ctr cmd', where the exec process is placed in the
> container's namespaces (although not a child of the container's PID
> 1).

Yeah, joining into a container is really weird, it makes a process
appear from nowhere, possibly blocking resources, outside of the
resource or lifecycle control of the code in the container, outside of
any security restrictions and so on.

I personally think joining a container via joining the namespaces
(i.e. podman exec) might be OK for debugging, but it's not a good
default workflow. Unfortunately the problems with the approach are not
well understood by the container people.

In systemd's own container logic (i.e. systemd-nspawn + machinectl) we
hence avoid doing anything like this. "machinectl shell" and
related commands will instead talk to PID 1 in the container and ask it
to spawn something off, rather than doing so yourself.

Kinda related to this: util-linux' "unshare" tool (which can be used
to generically enter a container like this) also is pretty broken in
this regard btw, and I asked them to fix that, but nothing happened
there yet:

https://github.com/util-linux/util-linux/issues/2006

I'd advise "podman" and these things to never place joined processes
in the root cgroup of the container if they delegate cgroup access to
the container, because that really defeats the point. Instead they
should always join the cgroup of PID 1 in the container (which they
might already do I think), and if PID 1 is in the root cgroup, then
they should create their own subcgroup "/joined" or so, and put the
process in there, to not collide with the "no processes in inner
groups" rule of cgroupv2.

> This is not a totally crazy thing to be doing - this was hit when testing a
> systemd container, using a container exec "probe" to check when the
> container is ready.
>
> More precisely, the problem manifests as follows (in
> https://github.com/systemd/systemd/blob/081c50ed3cc081278d15c03ea54487bd5bebc812/src/core/cgroup.c#L3676
> ):
> - Container exec processes are placed in the container's root cgroup by
> default, but if this fails (due to the "no internal processes" rule) then
> container PID 1's cgroup is used (see
> https://github.com/opencontainers/runc/issues/2356).

This is a really bad idea. At the very least the rule should be
reversed (which would still be racy, but certainly better). But as
mentioned they should never put something in the root cgroup if cgroup
delegation is on.

> - At systemd startup, systemd tries to create the init.scope cgroup and
> move all processes into it.
> - If a container exec process is created after finding procs to move and
> moving them but before enabling controllers then the exec process will be
> placed in the root cgroup.
> - When systemd then tries to enable controllers via subtree_control in the
> container's root cgroup, this fails because the exec process is in that
> cgroup.
>
> The root of the problem here is that moving processes out of a cgroup and
> enabling controllers (such that new processes cannot be created there) is
> not an atomic operation, meaning there's a window where a new process can
> get in the way. One possible solution/workaround in systemd would be to
> retry under this condition. Or perhaps this should be considered a bug in
> the container runtimes?

Yes, that's what I think. They should fix that.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Is systemd-cryptsetup binary internal?

2023-09-18 Thread Lennart Poettering
On Mo, 18.09.23 17:47, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> Hi,
>
> /usr/lib/systemd/ is indeed the place for internal binaries with
> > unstable interfaces. But it's also the place where we put binaries
> > that we don't typically expect users to call, because they are
> > generally called via some well define .service unit or so only.
> >
> > systemd-cryptsetup is one of the latter, we'd expect people to use
> > this via crypttab mostly. However, the interface is nonetheless
> > stable, it is a long-time part of systemd and so far we never broke
> > interface and I see no reason we ever would. In fact it might be a
> > candidate to move over to /usr/bin to make official, if there's
> > sufficient request for that. (such a request should be made via github
> > issue tracker)
> >
>
> Why was the decision taken to put these into /usr/lib/systemd instead of
> /usr/libexec/systemd/?

That's a Fedoraism. Why would one put something there?

/usr/lib/ is where private arch-dependent package stuff goes. What's
the rationale for /usr/libexec/ though?

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Is systemd-cryptsetup binary internal?

2023-09-18 Thread Lennart Poettering
On Mo, 18.09.23 15:22, mpan (systemdml-bfok4...@mpan.pl) wrote:

> Hello,
>
>   I got redirected to here from #systemd on Libera. While responding to a
> query from another person (not on #systemd), I came across an ambiguity. Any
> answer I give, its validity would be uncertain. I wish to receive an
> authoritative clarification.
>
>   There is systemd-cryptsetup binary in “/usr/lib/systemd/”. Its location
> suggests it’s internal to systemd and not intended for user invocation.
> However, it is also listed in manual as if it was something the user might
> be concerned with. The manual even has a specific, separate, explicit
> reference to systemd-cryptsetup page — though it’s shared with the
> corresponding service and the binary itself isn’t described.

/usr/lib/systemd/ is indeed the place for internal binaries with
unstable interfaces. But it's also the place where we put binaries
that we don't typically expect users to call, because they are
generally called via some well define .service unit or so only.

systemd-cryptsetup is one of the latter, we'd expect people to use
this via crypttab mostly. However, the interface is nonetheless
stable, it is a long-time part of systemd and so far we never broke
interface and I see no reason we ever would. In fact it might be a
candidate to move over to /usr/bin to make official, if there's
sufficient request for that. (such a request should be made via github
issue tracker)

>   Thanks in advance for indicating, if systemd-cryptsetup (the binary) is a
> tool users may rely on.

Yes, absolutely.

The only reason when we might break things for you is when we one day
move it from /usr/lib to /usr/bin, ;-)

Hence: the call interface is certainly stable, the location in that
sense maybe not yet.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] DynamicUser=yes leads to "Too many levels of symbolic links" for /etc/.pwd.lock

2023-09-14 Thread Lennart Poettering
On Do, 14.09.23 03:50, Muggeridge, Matt (matt.muggerid...@hpe.com) wrote:

> $ ls -l /etc/.pwd.lock
>
> lrwxrwxrwx 1 root root 19 Apr  5  2011 /etc/.pwd.lock -> sysconfig/.pwd.lock
>
> $ ls -l /etc/sysconfig/.pwd.lock
>
> -rw--- 1 root root 0 Aug 16 07:25 /etc/sysconfig/.pwd.lock
>
> For the purpose of investigation, I configured an overlay so /etc/.pwd.lock 
> was a simple writeable file (not a read-only symlink) and the service starts.
>
> Why is systemd complaining about the file being a symlink?

It's supposed to be a lock file, i.e. a regular file we issue POSIX file
locks on. It's not a config file.

The problem with symlinks for things like this is that in various
contexts these files are atomically replaced, and if that happens then
symlinks just make a mess, since it's not clear whether to replace the
symlink or its target.

Hence, we don't support that.

Generally, things like /etc/passwd is API pretty much, you cannot
really change it to a be a symlink (unless you make it fully
immutable), since it is updated by various tools and these tools tend
to do atomic updates of these files, i.e. when updating they write a
new file under a temporary name/O_TMPFILE, and then atomically move
it over the old file, so that other clients either get the old version
or the new version but never a half-updated version. This kind of
updating is really how you have to do things on UNIX, but that means
symlinks are out of the question...

Hence, TLDR: don't make the lock file a symlink. (Also, why would you even?)

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Fedora 38 and signed PCR binding

2023-09-11 Thread Lennart Poettering
On Mo, 11.09.23 14:48, Aleksandar Kostadinov (akost...@redhat.com) wrote:

> Hi again. I tried to boot from UKI to no avail.
>
> First created a "db" certificate
> > openssl req -newkey rsa:2048 -nodes -keyout db_arch.key -new -x509 -sha256 
> > -days 3650 -subj "/CN=My DB cert/" -out db.pem
> > openssl x509 -outform DER -in db.pem -out db.crt
>
> Then uploaded it to secure boot trust VIA USB drive and the  UEFI seup.
>
> Then created UKI:
> >   /usr/lib/systemd/ukify \
> > /lib/modules/6.4.12-200.fc38.x86_64/vmlinuz \
> > /boot/initramfs-6.4.12-200.fc38.x86_64.img \
> > --pcr-private-key=/etc/systemd/tpm2-pcr-private-key.pem \
> > --pcr-public-key=/etc/systemd/tpm2-pcr-public-key.pem \
> > --phases='enter-initrd' \
> > --pcr-banks=sha1,sha256 \
> > --secureboot-private-key=/etc/secure_boot/db.key \
> > --secureboot-certificate=/etc/secure_boot/db.pem \
> > --sign-kernel \
> > --cmdline='ro rhgb'
>
> Then added a boot entry:
> > efibootmgr -c -d /dev/sda -p 1 -l /EFI/FEDORA/UKI/VMLINUZ612.EFI -L "Fedora 
> > UKI"
>
> Unfortunately when trying to boot this I get:
> > Bad kernel image: Load Error

That suggests the kernel you picked does not carry a correct PE/MZ
signature. i.e. we generate that error typically if we can#t jump into
it because it doesn't come with the right PE headers.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-repart /etc automount via discoverable partition specification

2023-09-11 Thread Lennart Poettering
On Mo, 11.09.23 11:39, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> On Mon, Sep 11, 2023, 10:54 Lennart Poettering 
> wrote:
>
> > On So, 10.09.23 00:33, Nils Kattenbeck (nilskem...@gmail.com) wrote:
> >
> > > Hello, I am currently trying to build a linux image with discoverable
> > > partitions in an A/B+etc+var scheme.
> >
> > The discoverable partition scheme has no concept of /etc/ discovery. It
> > focusses on three basic setups:
> >
> > 1. writable root fs that contains /etc/, /var/ and /usr/ directly.
> > 2. writable root fs that contains /etc/ and /var/ and gets an
> >immutable /usr/ mounted in
> > 3. immutable root fs that contains /etc/ and /usr/ directly and gets a
> >writable /var/ mounted in. (the latter possibly as tmpfs, for truly
> >stateless systems)
>
> There is also 4. with a writeable root which only contains /etc, an
> immutable /usr and a temporary /var. Though I guess that can be covered
> with the existing DPS...?

That's pretty much the same as 2, except that /var is overmounted with
a tmpfs. i.e. you would simply place /etc/fstab in there, that says
/var is tmpfs.

> > It was our assumption that these three cases should cover most
> > intended behaviours nicely, i.e. systems with modifiable config, code
> > and state. systems with modifiable config and state, but immutable
> > code. And finally systems with immutable config and code, but
> > modifiable state.
> >
> > A system where /etc/ was separate from the root fs is not covered by
> > the above, because it is not clear what that would get us. if you want
> > it immutable, why not stick it on an immutable root fs. And if you
> > want it writable, why not stick it on a writable root fs directly?
>
> My use case is basically 2, /etc has to be writeable to persist the
> machine-id across reboots, /var also has to be writeable and /usr can be
> immutable.
>
> The problem I am then likely facing is that I create the partitions wrong.
> I am using mkosi and tried several different repart.d configuration with
> type=root+type=usr, type=root+type=var+type=use, and different CopyFiles=
> and Exclude(Target)Files= but none of them seemed to have worked.

if your /var/ is supposed to be a tmpfs, then don't mention it to
mkosi/repart, just put an /etc/fstab into place that dicates /var is
mounted as tmpfs.

Other than that you should just be able to use Type=root and Type=usr then.

> Are there special requirements for what the respective partitions must or
> shall not contain when using several auto-discovered partitions? Or should
> I ask on the mkosi issue tracker?

If you have just root + usr then this should be a pretty common
situation for mkosi, it's not special and should just work.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-repart /etc automount via discoverable partition specification

2023-09-11 Thread Lennart Poettering
On So, 10.09.23 00:33, Nils Kattenbeck (nilskem...@gmail.com) wrote:

> Hello, I am currently trying to build a linux image with discoverable
> partitions in an A/B+etc+var scheme.

The discoverable partition scheme has no concept of /etc/ discovery. It
focusses on three basic setups:

1. writable root fs that contains /etc/, /var/ and /usr/ directly.
2. writable root fs that contains /etc/ and /var/ and gets an
   immutable /usr/ mounted in
3. immutable root fs that contains /etc/ and /usr/ directly and gets a
   writable /var/ mounted in. (the latter possibly as tmpfs, for truly
   stateless systems)

It was out assumption that these three cases should cover most
intended behaviours nicely, i.e. systems with modifiable config, code
and state. systems with modifiable config and state, but immutable
code. And finally systems with immutable config and code, but
modifiable state.

A system where /etc/ was separate from the root fs is not covered by
the above, because it is not clear what that would get us. if you want
it immutable, why not stick it on an immutable root fs. And if you
want it writable, why not stick it on a writable root fs directly?

The design of saying "/etc/ is always part of the rootfs" is also
reflecting the fact that /etc/fstab is the map of secondary file
systems to mount, i.e. it generally contains references to other file
systems that take precedence over the discoverable partition spec, and
hence it is crucial that we place it on the first item in the chain so
that we can take it into account before looking for other items in the
chain.

> I know that /usr and /var have a
> corresponding partition UUID for automatically mounting them as per
> DPS. However, I am not sure how to mount the /etc partition? Do I have
> to specify it as the root partition and exclude /usr and /var in it?
> Any help would be appreciated.

If you want /etc/ split off, then the discoverable partition spec
won't help you: you have to mount it explicitly from your initrd.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Documentation question about sd-device

2023-09-11 Thread Lennart Poettering
On So, 10.09.23 11:45, CARLOS VILLANUEVA Y SIMON (carvi...@gmail.com) wrote:

> Hello all,
>
> Recently, while updating a code that was using libudev to a more modern
> API, sd-device, I could not find some of the functions that are defined at
> sd-device.h (
> https://github.com/systemd/systemd/blob/main/src/systemd/sd-device.h) at
> the man pages (
> https://www.freedesktop.org/software/systemd/man/sd-device.html), e.g.
> functions related with sd_device_monitor among others.
>
> May I be right or am I missing something?

Yeah, the documentation is not entirely complete yet. Sorry!

It's such a thankless job! But it's definitely on our TODO list.

If you can't guess how things work from the header, let us know, we
can provide you here with the necessary info to get things off the
ground.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Fedora 38 and signed PCR binding

2023-09-05 Thread Lennart Poettering
On Sa, 02.09.23 22:22, Aleksandar Kostadinov (akost...@redhat.com) wrote:

> Looking at the PR [1] it looks like I need to do a lot of things at
> each update manually. Is the thing in the comment the only thing I
> need to do or are there other things as well?

There's nowadays "ukify" that does all of this for you in one
relatively easy step, it's our recommended approach to building UKIs
these days.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Fedora 38 and signed PCR binding

2023-09-05 Thread Lennart Poettering
On Sa, 02.09.23 22:18, Aleksandar Kostadinov (akost...@redhat.com) wrote:

> Hello,
>
> Trying to configure Signed PCR binding on Fedora 38 by following
> article [1] and adapting commands for signing.
>
> What I did was basically this:
> > openssl genrsa -out /etc/systemd/tpm2-pcr-private-key.pem 2048
> > openssl rsa -in /etc/systemd/tpm2-pcr-private-key.pem -pubout -out 
> > /etc/systemd/tpm2-pcr-public-key.pem
> > sudo systemd-cryptenroll --tpm2-device=auto 
> > --tpm2-public-key-pcrs=7+9+11+12+13+14+15 /dev/sda3
> > added tpm2-device=auto,tpm2-pcrs=7+9+11+12+13+14+15
>
> But automatic unlocking does *not* work. And This is what
> systemd-measure returns:
>
> $ /usr/lib/systemd/systemd-measure status
> Warning: current kernel image does not support measuring itself, the
> command line or initrd system extension images.
> The PCR measurements seen are unlikely to be valid.
> # PCR[11] Unified Kernel Image (NOT SET!)
> 11:sha1=
> 11:sha256=
> # PCR[12] Kernel Parameters (NOT SET!)
> 12:sha1=
> 12:sha256=
> # PCR[13] initrd System Extensions (NOT SET!)
> 13:sha1=
> 13:sha256=
>
> Did I do something wrong? Is just necessary integration missing from
> Fedora 38 so I better revert to normal PCR binding?

Is your kernel built with sd-stub glued in fron of it? i.e. did you
use ukify?

Note that fedora still uses a legacy boot path with grub and
traditional kernels, instead of sd-boot/sd-stub and UKIs. PCR
measurements are messy there, and the pcr signature stuff as
implemented in systemd-measure doesn't work there.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Online backup API for systemd-journal?

2023-09-05 Thread Lennart Poettering
On Mo, 04.09.23 16:35, Etienne Doms (etienne.d...@gmail.com) wrote:

> Hi,
>
> I have some embedded systems in the wild, not connected to anything,
> on which you can push a button "something went wrong, create a dump".
> Then later I can fetch the said dump and inspect it.
>
> I'd like to include the whole journal, for the current boot, in a
> binary format so that I can later do "journalctl --file
> path/to/journal-dump.bin" from another machine. I understand that
> internally everything is stored in /var/log/journal/, but
> I guess that I cannot blindly tar/cp the .journal files, since this
> would be racy.

That should actually work fine. journald has no locking around journal
files: the server that writes to the files and the client that reads
them are not synchronized. The client is supposed to handle incomplete
writes by simply suppressing display of the trailing, incomplete
entries. This is a common code path, that is quite well tested these
days.

Hence, I should actually be fine to just copy the journal files as
they are being written, the tools on the other side will possibly then
see a file with records currently "in flight" that are referenced at
some places but not others, but that should be totally OK, the tools
should handle this, and this i no different from their local access.

> So, is there an API to safely dump a big ".journal" file containing a
> snapshot of "journalctl -b"? I could not find anything in the
> documentation, sorry in advance if I missed something obvious.

You can use "-o export" to dumb the files in an "export" format. But
this is just about returning the data in a different format, it does
not give you any synchronization guarantess since journalctl started
that way will just read the data from the journal files unsynchronized
as everyeone else too.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] [multiseat] Attach virtual input to seat1

2023-09-05 Thread Lennart Poettering
On So, 03.09.23 00:46, LuKaRo (li...@lrose.de) wrote:

>
> $ sudo loginctl attach seat1 /sys/devices/virtual/input/input43
> Could not attach device: No such device
> $ sudo loginctl attach seat1 /sys/devices/virtual/input/input44
> Could not attach device: No such device
> $ sudo loginctl attach seat1 /sys/devices/virtual/input/input23
> Could not attach device: No such device
> $ sudo loginctl attach seat1 /sys/devices/virtual/input/input22
> Could not attach device: No such device
> $ sudo loginctl attach seat1 /sys/devices/virtual/input/input21
> Could not attach device: No such device
>
> Any idea why all of them fail, and what could be a possible
> workaround?

See my reply here:

https://lists.freedesktop.org/archives/systemd-devel/2023-September/049470.html

The key is that the udev property ID_FOR_SEAT is not set for these
devices. (We should definitely generate a more useful error in that
case.) Only devices that have that property set can be assigned to seats.

ID_FOR_SEAT is supposed to carry some form of stable ID string we can
identify the device with, that remains the same between reboots. We
currently set ID_FOR_SEAT to useful values for PCI and USB devices,
but not on other busses. In particular virtual devices are not
covered. "input23" is not useful as an identifier string in
ID_FOR_SEAT, because they are assigned in the order of probing, which
typically is not stable.

it should suffice setting the udev property via some udev rule to
something reasonable, for the devices you add... I have no idea how
that looks like for your specific type of devices.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] [multiseat] How to make automatic ACL creation via udev "uaccess" tag work for seats other than seat0?

2023-09-01 Thread Lennart Poettering
On Fr, 01.09.23 13:13, Christian Pernegger (perneg...@gmail.com) wrote:

> Of course, if you want to take the position that it's a bit weird for
> GNOME to use /dev/rfkill to detect the presence of BT devices, I can't
> argue against that. :)

Doesn't NM/bluez manage these things from privileged code anyway? Is
this really done from inside the GNOME UI with direct device access?

> (From a use case perspective, it would be nice if paired BT devices
> could somehow be tagged. I.e. so that each seat can pair devices and
> manage them, but not see or manage ones paired by other seats and/or
> users.)

Yeah, it would be great if bluez would gain native multi-seat support,
i.e. that it tracks seat assignments for paired devices. But that's
something to request from bluez upstream, not systemd.

> > You cannot attach devices to multiple seats.
>
> Roger that. Is there a way to exempt devices from the multiseat
> mechanism, though? Mark them not seat-specific? Or is that
> hard-coded?

change the udev rules to not set the "seat" udev property on the
relevant device. That's what decides whether seat mgmt is done for the
device.

> > You should be able to assign the device to a different seat though.
>
> systemctl attach won't let me, at least not using the path seat-status
> spits out. But I'm sure the version of systemd in Ubuntu 22.04 is
> ancient, and/or they may have done something to it. If you like, I can
> try whether adding a udev rule manually works, but personally I'm not
> too bothered about this particular issue.

So the problem is that the rfkill device does not carry the
ID_FOR_SEAT property right now, we only add that for pci/usb/…
devices, i.e. the usual busses. rfkill being a virtual device doesn't
carry that property.

That property carries the string identifier that we should use for
identifying the device for seat mgmt purposes. It's usually derived
from the path ID of the device.

To make rfkill managable via "loginctl attach" would mean adding such
a property there. happy to take a patch for that, please submit via
github.

> > that's how things work and people assume them to work: graphics render
> > services are used to bring stuff to screen.
>
> I don't know about this. Yes, seat1 could hog the GPU that seat0's
> outputs are attached to, or vice versa, but seat1 could just as well
> hog all the RAM or saturate the CPU. My point being, seats share the
> host's CPU power, RAM, ..., already, why not the rendering/compute
> power as well. IMVHO it's really just inputs and outputs that should
> be seat-specific. Restricting the shared resources available to a
> given seat, allocating them fairly, etc., is a different problem (and
> arguably one that I'd tackle per user and not per seat).

CPU/RAM are by default resource managed, i.e. each user logged in gets
a similar amount under pressure, as controlled via the cgroups
logic.

This is different from GPU resources, there's no such reosurce
management for that.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] [multiseat] How to make automatic ACL creation via udev "uaccess" tag work for seats other than seat0?

2023-09-01 Thread Lennart Poettering
On Fr, 01.09.23 02:02, Christian Pernegger (perneg...@gmail.com) wrote:

> Am Do., 31. Aug. 2023 um 21:55 Uhr schrieb Andrei Borzenkov
> :
> >
> > On 31.08.2023 19:22, Christian Pernegger wrote:
> > There is no ID_SEAT, so this device [/dev/rfkill] ]belongs to seat0 by 
> > default.
>
> It makes no sense for /dev/rfkill to belong to a specific seat,
> though.

Typically any RF kill buttons are attached to the main seat of a
laptop only, hence this assignment.

> GNOME at least assumes the user to have write access.
> Note that while /sys/devices/virtual/misc/rfkill shows up in the
> output of loginctl seat-status it cannot be attached to another seat
> ("Could not attach device: No such device").

You cannot attach devices to multiple seats. You should be able to
assign the device to a different seat though.

> Or what about /dev/kvm? Why should only seat0 have the ability to use
> KVM? (It can't be attached to other seats, either.)

/dev/kvm is 0666 by default, except of some distros that depart from
that. Please contact them for help how they intend to manage access
there, but the uaccess logic is not it.

> The dri/renderD??? device is automatically attached to the seat that
> the dri/card? one is attached to (even though it isn't a child
> according to the seat-status tree--funnily enough this does not happen
> for the fb? device).

fb is obsolete. fb devices are still assigned to seats but no unpriv
access is granted.

> It makes sense that the rendering bits of a card should "belong to"
> the seat that has the outputs, the problem is that this renders it
> inaccessible to the other seats, which it shouldn't. A seat can
> access another seat's *rendering capabilities* just fine as long as
> the permissions are set correctly.

Well, you can do lots of things. We ship defaults only. Feel free to
write udev rules that assign things to whatever you want them to be
assigned.

By default render devices are only accessible to local users on the
seat they are logged on to, not everyone else, since typically
resources on a graphics card are bounded, and it makes sense to give
access to users who also get access to the screen, because typically
that's how things work and people assume them to work: graphics render
services are used to bring stuff to screen. There's also a "render"
group set up to which users can be added which should always get
access.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Custom Localed Configuration Location

2023-08-31 Thread Lennart Poettering
On Di, 29.08.23 17:18, TJ Shipp (onezoo...@msn.com) wrote:

> I am trying to create a system where we can change locale on a
> running system (where we would have daemons subscribe to dbus and
> get the properties changed messages) but need to be able to change
> the location of the locale file (by default in /etc/locale.conf) as
> /etc is read-only on our system.

We do not support that. /etc/ is the place for configuration on Linux,
and if you make that immutable you basically turn off the ability to
configure things at runtime. Which is totally OK to do of course, but
if this is the mode you pick you shouldn't be surprised that this is
what you get.

> Is there a way to change the file location to a writeable location
> as I can not find any current means to do such?

This is not configurable, the path /etc/locale.conf is considered
API. It's not a hidden backend or so, but a primary interface to this
setting.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Additional Locale Variables for Units and Number Format

2023-08-31 Thread Lennart Poettering
On Di, 29.08.23 17:17, TJ Shipp (onezoo...@msn.com) wrote:

> I am trying to add in support for a separate variable to change our unit 
> system, and having both LANG and UNITS to identify the "locale" of the system.
> We are also not only looking for English versus Metric, but are looking for 
> mixed units as well (both Imperial and Metric hybrid), as well as looking to 
> add number formats (1,000.00 vs 1.000,00)
>
> And what is the best way to add support for a new system environment variable 
> such as UNITS?
>
> P.S. If anyone is interested in contracting to do this work, please send me a 
> private message outside this list.

systemd-devel is not the right forum for this. Not sure what a better
forum for this is, but systemd is way too low-level system stuff for
that.

Hence, I don't know who to suggest you to contact about this, but
maybe someone at the Linux Foundation can connect you.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Flushing DNS caches items on clock change.

2023-08-31 Thread Lennart Poettering
On Mi, 30.08.23 18:06, Vishwanath Chandapur (vishwa...@gmail.com) wrote:

> Hi,
>
> We are using systemd-resolved. We observed that on clock change,
> systemd-resolved is flushing all caches.
>
> By looking into the code we found that this is implemented primarily for
> DNSSEC.
>
> Is there any specific reason for flushing the other cache items like mDNS,
> LLMNR?

Usually things like clock jumps happen on system suspend/resume
cycles, VM migration and other VM management non-linearities. Quite
often this coincides with network connectivity changes, and hence we
should invalidate whatever information we collected so far about the
network.

Given this is redundant info we can reacquire this should not be an issue.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Assertion '!ether_addr_is_null(addr)'

2023-08-31 Thread Lennart Poettering
On Mi, 30.08.23 15:22, Mirza Krak (mirza.k...@gmail.com) wrote:

> Hi,
>
> Environment:
> * systemd: 250.5

This release is from 2021, i.e. relatively old. The issue you are
descriping is almost certainly aleady addressed in newer
versions. Consider using a new version. Or contact your OS vendor,
asking them to maybe backport the fix in question.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Is it possible to change the cgroup uid/gid for a systemd slice?

2023-08-31 Thread Lennart Poettering
On Mi, 30.08.23 23:08, Julio Lajara (julio.laj...@protonmail.com) wrote:

> Hi all, I have created a systemd slice to constrain CPU/mem
> resources for a service unit. The service unit runs as root (its a
> bash script) and it runs a subprocess using systemd-run that it also
> runs under the same slice but a different unprivileged user. The
> subprocess needs to read the cgroup memory data directly from the
> sysfs tree but it cant because its owned by root.

sysfs tree? You mean cgroupfs tree?

But the memory attributes are world readable, so no need to chown.

> Is there way I can change the permissions on it in the slice similar
> to how cgcreate has the -a option to set the uid/gid for the cgroup?

There's not. chowing of cgroups is pretty much about the ability to
change them or create subgroups in them, but we do not allow either to
client programs for slices.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Why are the priorities of stdout and stderr the same

2023-08-29 Thread Lennart Poettering
On Di, 29.08.23 11:56, Cecil Westerhof (cldwester...@gmail.com) wrote:
> > I agree with that usecase, and we have discussed this many times
> > before, but we couldn#t come up with a nice way to make everything
> > work: proper ordering and distintion of stdout/stderr.
>
> I agree that the default behaviour is the right one. But why not give
> people the possibility to override this behaviour? When they override it
> themselves, they cannot complain that they lose ordering.

We are generally conservative when providing mechanisms that are too
glaringly broken. Even if they are opt-in.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Why are the priorities of stdout and stderr the same

2023-08-29 Thread Lennart Poettering
On Di, 29.08.23 11:20, Cecil Westerhof (cldwester...@gmail.com) wrote:

> Also: everything has a timestamp, so there is in my opinion when you choose
> to take them apart no big problem.

For stream connections like those used for stdout/stderr, lines do not
come with timestamps. We add them on the reception side, which is too late.

> > > To get what is send to stderr I had to do:
> > > journalctl -p 6 -u aptCacheUsage.service
> > >
> > > which gave beside a lot of other things the things send to stdout.
> > >
> > > Now I have two different statements I can do:
> > > journalctl -p 3 -u aptCacheUsage.service
> > >
> > > But it would be nice if I did not need two different statements (and the
> > > logic around that) for that.
> >
> > Still not getting what you are trying to say here.
> >
>
> Often I am only interested in what is sent to stderr and do not want what
> is sent to stdout. When both have the same log level I can not really
> filter on messages sent to stderr. At the moment I want to see the messages
> sent to stderr, I will also get the messages sent to stdout because they
> have the same error level.

I agree with that usecase, and we have discussed this many times
before, but we couldn#t come up with a nice way to make everything
work: proper ordering and distintion of stdout/stderr.

The closest I cam was using two distinct SOCK_DGRAM sockets
connect()ed to the same target socket (instead of the current approach
of using SOCK_STREAM). This would give us two benefits: for each
deliverd datagram we would get a source socket address reported to us,
and it will tell us which of the two source sockets it was, hence
hence if stdout or stderr. Moreover, we would get a kernel-supplied
kernel timestamp on each datagram if we want. This however has a
fairly big problem too: if programs write too much data into their
stdout/stderr at once they would get EMSGSIZE back, which programs
generally don't expect (i.e. if write()'s size is larger than datagram
max size you get EMSGSIZE). Programs trying to write too much usually
expect blocking behaviour... Thus this approach is not really an
option.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Why are the priorities of stdout and stderr the same

2023-08-29 Thread Lennart Poettering
On Sa, 26.08.23 06:14, Cecil Westerhof (cldwester...@gmail.com) wrote:

Please keep mails like this on the mailing list.

> > We should not "assume the worst", hence given that the stderr stream
> > is typically used for all kinds of informational messages we should
> > not always assume its an error, because quite often its just
> > informational.
> >
>
> You have a very good point. When tcl opens a process for reading, it is an
> error when there is something to read on stderr, except when you overrule
> it. But that you can overrule it proves your point.
>
> > Hence, we use LOG_INFO if we have no clue simply because that's the
> > "best assumption".
> >
>
> I agree, but I would suggest a very simple solution.
> There is SyslogLevel which sets the syslog level for stdout and stderr. I
> would suggest adding SyslogLevelStderr. SyslogLevel would still set it for
> both except when there is also SyslogLevelStderr.

When journal redirection of both stdout + stderr is enabled for
systemd services we'll connect a single pipe to both fds, in order to
guarantee ordering, i.e. ensure that if something is written to
stdout, and then something to stderr, we'll definitely process it in
this order too. This however means, that on the receiving side we
cannot distinguish stdout/stderr anymore, it's all one stream. Hence
we can only choose between: guarantee correct ordering OR ability to
distinguish stdout/stderr. We opted for the former, as corrupted
ordering between stdout/stderr is just too confusing for users.

> > We generally recommend apps to use syslog() or sd-journal APIs to
> > generate their log messages and specify the log level for each message
> > explicitly, to avoid any doubts. Many programming language's logging
> > frameworks natively have support for these.
>
> The script I use can be run from the command-line and from a service.
> Because of that I have to use:
> logMsg --simple "${message}" >&2
> and:
> echo "<3>$(logMsg --simple "${message}")" >&2
>
> doable but inconvenient.
>
> > Now when I want the things send to stderr I also get the things send to
> > > stdout.
> >
> > I can't parse that.
> >
>
> To get what is send to stderr I had to do:
> journalctl -p 6 -u aptCacheUsage.service
>
> which gave beside a lot of other things the things send to stdout.
>
> Now I have two different statements I can do:
> journalctl -p 3 -u aptCacheUsage.service
>
> But it would be nice if I did not need two different statements (and the
> logic around that) for that.

Still not getting what you are trying to say here.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Append to logfile with year-month

2023-08-25 Thread Lennart Poettering
On Do, 24.08.23 09:48, Cecil Westerhof (cldwester...@gmail.com) wrote:

> In a service file I can use:
> StandardOutput=append:/var/log/root/aptCacheUsage.log
>
> but I want to use something like:
> StandardOutput=append:/var/log/root/aptCacheUsage_$(date +%%Y-%%m).log
>
> Did does not work, because this puts it in:
> /var/log/root/aptCacheUsage_$(date +%Y-%m).log
>
> Is there a way I can put it in:
> /var/log/root/aptCacheUsage_2023-08.log
>
> while it would automatically next month go into:
>/var/log/root/aptCacheUsage_2023-09.log
>
> I could of-course put it into:
> /var/log/root/aptCacheUsage.log
>
> and at the beginning of the month move it if it exists with a timed
> service, but I really would not like that kind of solution.

We do not support this. systemd supports evaluating some specifiers,
but time/date is not one of them, in particular as we resolve
specifiers at parse time of the unit only, not afterwards. or in other
words: we'd resolve the specifiers early at boot, and that doesn't
look like what you want.

Also, for long-running services this wouldn#t work anyway, as we can't
rotate files like that, because we cannot externally close the current
stdout of a process and replace it with a new file.

hence, what you are trying to do is not supported, and is unlikely to
ever be supported for multiple reasons.

sorry!

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Why are the priorities of stdout and stderr the same

2023-08-25 Thread Lennart Poettering
On Do, 24.08.23 16:31, Cecil Westerhof (cldwester...@gmail.com) wrote:

> Normally in a script when something is send to stdout it is seen as an
> error has occurred.
> But in systemd both get a priority of 6 (info).
> Why does stderr not get a priority of 3 (err), or at least lower as
> stdout?

stderr is a bit of a misnomer, it's not just for errors, it's also for
progress output, informational output and so, basically everything
that is not considered the primary output contents that one would want
to propagate in pipelines.

We should not "assume the worst", hence given that the stderr stream
is typically used for all kinds of informational messages we should
not always assume its an error, because quite often its just
informational.

Hence, we use LOG_INFO if we have no clue simply because that's the
"best assumption".

We generally recommend apps to use syslog() or sd-journal APIs to
generate their log messages and specify the log level for each message
explicitly, to avoid any doubts. Many programming language's logging
frameworks natively have support for these.

> Now when I want the things send to stderr I also get the things send to
> stdout.

I can't parse that.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Error during SCC_DAEMON installation

2023-08-25 Thread Lennart Poettering
On Do, 24.08.23 13:28, Maber, Paul (paul.ma...@cgi.com) wrote:

> Classification: Confidential
>
> When installing the SAP Cloud Connector, I am getting the following errors.  
> The installation is being performed by the user root as instructed.
>
> :/opt/sap/scc # journalctl -xeu scc_daemon.service
> Aug 24 13:41:35  scc_daemon[5574]: scc_Daemon start failed, see 
> logfile: /opt/sap/scc/scc_daemon.log

systemd is just the messenger here. Please contact SAP for help on
this SAP product, not the systemd project.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-cryptenroll with TPM2

2023-08-23 Thread Lennart Poettering
On Di, 22.08.23 22:35, Aleksandar Kostadinov (akost...@redhat.com) wrote:

> On Tue, Aug 22, 2023 at 8:10 PM Lennart Poettering
>  wrote:
> > On Di, 22.08.23 19:16, Aleksandar Kostadinov (akost...@redhat.com) wrote:
> <...>
> > > If attacker replaces volume with unencrypted one, and it boots without
> > > messing up the sealing PCRs, then probably attacker can query the TPM
> > > and obtain the encryption key. Despite the fact that this is not (yet)
> > > implemented in cryptenroll.
> >
> > Sure, if you allow unencrypted systems to boot in your OS then all
> > bets are off. You shouldn't do that of course.
> >
> > (in my model of mind, where automatic GPT image dissection is used the
> > image dissection policies are how this should be locked down, see
> > systemd.image-policy(7). You can confgure that via the kernel cmdline:
> > in systemd.image_policy=.
> >
> > In systemd there's the "systemd-pcrfs@.service" and
> > "systemd-pcrmachine.service" which will measure the identity of file
> > systems and of /etc/machine-id into PCR 15. (systemd-cryptsetup also
> > mesures a derivate of the volume key to PCR 15). PCR 15 is supposed to
> > be an identifier of the OS instance.
>
> Wait. I was looking at this PCR. But wouldn't it be set only after the
> volume has been unlocked? This means that before a volume is unlocked,
> it cannot protect anything? Actually it may protect in case where
> attacker replaced the volume with another encrypted volume. But not if
> attacker replaced with a plain volume.

As I said earlier: if you don't encrypt you lost anyway. This is not a
scenario I care about in my view of the world. And frankly, it really
doesn't make much sense to try to lock down boot but not actually
encrypt the disk...

> Or is it measured with the *encrypted* volume key which would actually
> protect from volume replacement of any sort (I think) and would mostly
> solve my concern?

No, we measure the decrypted volume key (or actually, we measure the
result of an HMAC of a fixed string, keyed by the volume key, since we
don't want the key to show up in measurement logs in any useable way).

> I mean if somehow the LVM structure including the encrypted key(s) are
> measured somewhere, then such an attack should not be viable.

LVM? what's LVM got to do with anything?

> I guess I should test whether replacing the volume with non-encrypted
> will work. If it works, then there might be an issue. If it does not
> work, then sealing with PCR 15 might be what will get me going,
> because replacing with an encrypted volume will definitely modify it
> and block decrypting of the original key.

In my view of the world you have an authenticated + measured UKI that
unlocks the encrypted root fs, and simply refuses to boot if the root
fs is not encrypted with a key it can acquire somehow. This should
give you all the protection you need.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-cryptenroll with TPM2

2023-08-22 Thread Lennart Poettering
On Di, 22.08.23 19:16, Aleksandar Kostadinov (akost...@redhat.com) wrote:

> > > I'm concerned though about an attacker replacing the encrypted root volume
> > > with a non-encrypted one. Which may result in system booting an attacker
> > > controlled environment while PCRs may be in a state that allows decryption
> > > of the original root volume.
> > >
> > > Would anything prevent the system from booting with a replaced root
> > > volume?
> >
> > Well, when you bind your disk to the TPM then this means you place a
> > TPM-encrypted key in the LUKS header. This key has to be passed to the
> > right TPM to be unlocked. This means that if an attacker just has the
> > disk it's hard for them to acquire the decrypted key if it lacks the
> > TPM. But it also means that if an attacker wants to replace the disk
> > its very hard to forge key that is locked against that specific TPM.
>
> If attacker replaces volume with unencrypted one, and it boots without
> messing up the sealing PCRs, then probably attacker can query the TPM
> and obtain the encryption key. Despite the fact that this is not (yet)
> implemented in cryptenroll.

Sure, if you allow unencrypted systems to boot in your OS then all
bets are off. You shouldn't do that of course.

(in my model of mind, where automatic GPT image dissection is used the
image dissection policies are how this should be locked down, see
systemd.image-policy(7). You can confgure that via the kernel cmdline:
in systemd.image_policy=.

In systemd there's the "systemd-pcrfs@.service" and
"systemd-pcrmachine.service" which will measure the identity of file
systems and of /etc/machine-id into PCR 15. (systemd-cryptsetup also
mesures a derivate of the volume key to PCR 15). PCR 15 is supposed to
be an identifier of the OS instance.

> > It analyzes the UEFI TPM event log (which lists all measurements made
> > to PCRs), tries to recognize components in it safely. And then is
> > supposed to use that to generate signed PCR policies from that, based
> > on a keypair stored on the local TPM, that is itself protected by one
> > of its own signed PCR policies.
> >
> > In the long run the way I envision this we'd have two signed PCR
> > policies in place:
> >
> > 1. A vendor supplied one that covers the UKI and its resources (this
> >already pretty much exists), i.e. PCR 11. This one is pre-computed
> >at build time of the OS and hence can only cover resources known at
> >that time.
> >
> > 2. A locally maintained one on the individual system, based on a local
> >key, that covers everything inherently local that is hard to
> >predict from the outside (and for good measure also covers the
> >vendor supplied stuff, because why not). This would then cover PCRs
> >0-7, 9, 11-13, 15, i.e. everything that is reasonably stable
> >locally.
> >
> > Alas, as mentioned this is WIP, still.
>
> I didn't expect the unattended server TPM2 encryption to be such a
> muddy ground. Probably because serious use cases also involve more
> infrastructure and dedicated admins, etc.

It is certainly my intention to make this all "just work" and "default
on", even on consumer hw. Windows does it, so we should be able to do
that as well.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] systemd-cryptenroll with TPM2

2023-08-22 Thread Lennart Poettering
On Mo, 21.08.23 19:56, Aleksandar Kostadinov (akost...@redhat.com) wrote:

> Thanks, this is what I was also considering the feasibility of. And whether
> it made sense to begin with. Any idea how can this be done with systemd?
>
> In man I read:
>
> >   Note that currently when enrolling a new key of one of the five
> >   supported types listed above, it is required to first provide a
> >   passphrase, a recovery key or a FIDO2 token. It's currently not
> >   supported to unlock a device with a TPM2/PKCS#11 key in order to
> enroll
> >   a new TPM2/PKCS#11 key. Thus, if in future key roll-over is desired
>
> So I wonder if systemd already does that, or is it just an artificial
> limitation? Would be wonderful if it already did so.

It's just that noone implemented this. The unlocking code paths via
cryptsetup and in cryptenroll are quite different, which doesn't make
this trivial.

But pacthes welcome.

Generally, I am very much of the opinion that we shouldn't change the
disk whenever PCRs change. Instead we should use signed PCR policies
to accomodate for "clean" PCR changes (as mentioned in the other mail
in this thread), i.e. simply sign a new PCR policy if we learn about a
new "golden" PCR state we want to permit. This is much more robust and
scales better. Moreover, it makes it easy to invalidate old golden
states, by implicitly binding things to an nvindex counter object in
the TPM at the same time. Such rollback protection is kinda crucial I
am sure to guarantee security of non-interactive systems.

> P.S. Also another thing I was considering was that if I did this
> "extension", then I'm not sure how to then properly setup the sealing. But
> maybe with the signed PCRs support it can work as PCRs don't need to be in
> the expected state at configuration time. But also I want to do with as
> little modifications from defaults as possible. If I have to rewrite the
> whole thing, it will be hard but also I don't want to risk making mistakes
> that original scripts already avoid.

Neither for the literla PCR policies nor for the signed PCR policies
the PCRs actailly need to be in the state we expected states when
enrolling. Support for the former was recently added upstream.

Lennart

--
Lennart Poettering, Berlin


  1   2   3   4   5   6   7   8   9   10   >