Re: [systemd-devel] Reducing unmount/mount of partitions on soft-reboot
On Mi, 13.03.24 16:57, Aditya Gupta (adit...@linux.ibm.com) wrote: > Hello, > > I tried systemd-soft-reboot on a RHEL system, and it's amazing in terms > of it's ability to do a userspace reboot, within fraction of time of a > full system reboot. For example, for a Power system taking around 50 > seconds to do a normal reboot, it took around 4-5 seconds for a > systemd-soft-reboot. > > I have a question on further optimisation. After soft-reboot, I notice > much of the time is taken up by .device and .mount services. This was my > observation based on 'systemd-analyze blame'. Please do let me know if > I am seeing the wrong numbers, or if there's a better way to know. > > Is there some way to 'pass-through' these mounts ? That is, I might not > need to unmount and remount my boot/root paritions. Bind mount the relevant mounts from the current system into /run/nextroot/ if you are using that. If you are not using /run/nextroot/ then you can also define the mount via a .mount unit (rather letting it be auto-generated via /etc/fstab + systemd-fstab-generator), and then set DefaultDependencies=no in it, so that it does not get an implicit Conflicts= dependency on umount.target. This is briefly documented on the systemd-soft-reboot.service man page btw. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] How to install libudev from source?
On Do, 07.03.24 17:09, Vru Inbvi (vru.in...@gmail.com) wrote: > Hi, > > I am struggling to install libudev from source (with Ubuntu) > Can someone please explain what the correct way to do this is, or point me > to relevant/updated documentation? https://systemd.io/HACKING Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Query on sshd.socket sshd.service approaches
On Mi, 06.03.24 13:06, Arseny Maslennikov (a...@cs.msu.ru) wrote: > > The question of course is how many SSH instances you serve every > > minute. My educated guess is that most SSH installations have a use > > pattern that's more on the "sporadic use" side of things. There are > > certainly heavy use scenarios though (e.g. let's say you are github > > and server git via sshd). > > A more relevant source of problems here IMO is not the "fair use" > pattern, but the misuse pattern. > > The per-connection template unit mode, unfortunately, is really unfit > for any machine with ssh daemons exposed to the IPv4 internet: within > several months of operation such a machine starts getting at least 3-5 > unauthed connections a second from hierarchically and geographically > distributed sources. Those clients are probing for vulnerabilities and > dictionary passwords, they are doomed to never be authenticated on a > reasonable system, so this is junk traffic at the end of the day. > > If sshd is deployed the classic way (№1 or №3), each junk connection is > accepted and possibly rate-limited by the sshd program itself, and the > pid1-manager's state is unaffected. Units are only created for > authorized connections via PAM hooks in the "session stack"; > same goes for other accounting entities and resources. > If sshd is deployed the per-connection unit way (№2), each junk connection > will > fiddle with system manager state, IOW make the machine create and > immediately destroy a unit: fork-exec, accounting and sandboxing setup > costs, etc. If the instance units for junk connections are not > automatically collected (e. g. via `CollectMode=inactive-or-failed` > property), this leads to unlimited memory use for pid1 on an unattended > machine (really bad), powered by external actors. Well, whatever sshd does as ratelimiting systemd can do to afaics. I.e. the sshd@.service definition we suggest that and that the big distros use all get the ExecStart=- thing right, so that an unclean exit of sshd does not result in a pinned unit. Moreover, there's PollLimitIntervalSec=/PollLimitBurst=, MaxConnectionsPerSource=, MaxConnections= that ensures that any attempt to flood the socket is reasonably contained, and the system recovers from that. Current versions of systemd enable these settings by default, hence I think we actually should be fine by default, even if you do not tune these .socket parameters. > > I'd suggest to distros to default to mode > > 2, and alternatively support mode 3 if possible (and mode 1 if they > > don#t want to patch the support for mode 3 in) > > So mode 2 only really makes sense for deployments which are only ever > accessible from intranets with little junk traffic. What precisely do you think is missing in systemd that PollLimitIntervalSec=/PollLimitBurst=, MaxConnectionsPerSource=, MaxConnections= can't cover? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Query on sshd.socket sshd.service approaches
On Mi, 06.03.24 14:44, Shreenidhi Shedi (shreenidhi.sh...@broadcom.com) wrote: > > Lennart Poettering, Berlin > > Thanks a lot for the responses Andrei, Poettering . > We took it from blfs in PhotonOS. > https://www.linuxfromscratch.org/blfs/view/11.3-systemd/introduction/systemd-units.html > We need to do some more work on these unit files. But that tarball actually contains a correct sshd -i line that includes the "-" that makes the return values to be ignored as it should. Hence if your distro didn't do this even though it imported this from LFS, then it's your distro that broke that... Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Query on sshd.socket sshd.service approaches
On Mi, 06.03.24 11:11, Shreenidhi Shedi (shreenidhi.sh...@broadcom.com) wrote: > Hi All, > > What is the rationale behind using sshd.socket other than not keeping sshd > daemon running always and reducing memory consumption? Note that there are two distinct modes to running sshd via socket activation: the per-connection mode (using sshd's native inetd mode), where there's a separate instance forked off by systemd for each connection, and the a mode where systemd just binds the socket, but it's served by a single instance. The latter is only supported via an out-of-tree patch afaik though, which at least debian/ubuntu ship: https://salsa.debian.org/ssh-team/openssh/-/commit/7fa10262be3c7d9fd2fca9c9710ac4ef3f788b08 Unless you have a gazillion of connections coming in every second I'd probably just use the per-connection inetd mode, simply because it's supported upstream. Would be great of course if openssh would just add support for the single-instance mode in upstream too, but as I understand ssh upstream is a bit special, and doesn't want to play ball on this. To summarize the benefits of each mode: 1. Traditional mode (i.e. no socket activation) + connections are served immediately, minimal latency during connection setup - takes up resources all the time, even if not used 2. Per-connection socket activation mode + takes up almost no resources when not used + zero state shared between connections + robust updates: socket stays connectible throughout updates + robust towards failures in sshd: the bad instance dies, but sshd stays connectible in general + resource accounting/enforcement separate for each connection - slightly bigger latency for each connection coming in - slightly more resources being used if many connections are established in parallel, since each will get a whole sshd instance of its own. 3. Single-instance socket activation mode + takes up almost no resources when not used + robust updates: socket stays connectible throughout updates > With sshd.socket, systemd does a fork/exec on each connection which is > expensive and with the sshd.service approach server will just connect with > the client which is less expensive and faster compared to > sshd.socket. The question of course is how many SSH instances you serve every minute. My educated guess is that most SSH installations have a use pattern that's more on the "sporadic use" side of things. There are certainly heavy use scenarios though (e.g. let's say you are github and server git via sshd). I'd suggests to distros to default to mode 2, and alternatively support mode 3 if possible (and mode 1 if they don#t want to patch the support for mode 3 in) > And if there are issues in unit files like in > https://github.com/systemd/systemd/issues/29897 it will make the system > unusable. Did any distro ship a unit file like that? That was clearly a buggy (local?) unit file, I am not aware of any big distro shipping such a unit file. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Can I provide separate enabling for dbus-activation and "normal" start ?
On Do, 22.02.24 17:09, Max Gautier (m...@max.gautier.name) wrote: > Is it possible when writing a dbus-activable service to provide two > separate and independent ways to enable it ? > > The D-Bus service file would for instance be: > [D-BUS Service] > Name=org.freedesktop.Notifications > Exec=notification-daemon > SystemdService=dbus-org.freedesktop.Notifications.service > > The systemd service: > [Unit] > PartOf=graphical-session.target > After=graphical-session.target > > [Service] > Type=dbus > BusName=org.freedesktop.Notifications > ExecStart=notification-daemon > > [Install] > Alias=dbus-org.freedesktop.Notifications.service > WantedBy=graphical-session.target > > > With that systemd service file, `systemctl enable` would cause the > service to be started by graphical-session.target and by > dbus-activation; but it is possible to have two separate enable > commands, one which would enable the dbus activation, one the > graphical-session start ? > > I suppose I should have two separate unit files but I'm not completely > sure how to do that without copying the whole file (i.e, is there some > Install/Unit relation I can use for that ?) No, in systemd there's only one "systemctl enable" and it applies the [Install] section of the unit file, and that's really all there is. You can probably add two unit files and use Alias= so that they pick a common name as alias. But one unit cannot have two distinct [Install] sections, if that's what you are looking for. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Issues supporting systems with and without TPM and firmware TPM (was Re: Handle device node timeout?)
On Di, 20.02.24 10:24, Mikko Rapeli (mikko.rap...@linaro.org) wrote: > Thanks, I will check this. It sounds like optee needs a similar dependency > generator. > > I wonder how many kernel subsystems/drivers which need userspace daemons > would need systemd side dependency generators. Is it only the ones inside > initramfs and/or pre-rootfs mount which need special handling? Well, systemd to a large part is about getting deps in order, i.e. start things in the right order but still as parallelized as possible to make sure we can boot properly, fast. For regular (i.e. late boot) services things are easier, since we can hide various deps via socket activation and services typically just have fewer deps, but during early boot things always require careful consideration on what you need to schedulen when. That's hardly surprising, isn't it? TPM stuff in particular is stuff that we want to make use of super early, because it's inherently part of the boot process to measure progress and resources we are using. It's what "Measured Boot" after all means. And that means you need to know what you do, and can't really escape that. > In the end the logic is quite straight forward. If kernel side support is > there, then a daemon needs to be started before user service start, but > boot should continue without if kernel support is not detected. systemd generators are our way to allow dynamic extension of the systemd unit dependency graph. It's the fact that you want things dynamic (i.e. responsive to the fact whether your system has a specific kind of tpm device/secure enclave) that means you have to do with a generator. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Issues supporting systems with and without TPM and firmware TPM (was Re: Handle device node timeout?)
On Mo, 19.02.24 10:36, Mikko Rapeli (mikko.rap...@linaro.org) wrote: > > After=dev-tpmrm0.device tee-supplicant@teepriv0.service > > Wants=dev-tpmrm0.device tee-supplicant@teepriv0.service > > I think my problems come from: > > After=tee-supplicant@teepriv0.service > Wants=tee-supplicant@teepriv0.service > > Basically tee-supplicant should only be started if /dev/teepriv* device node > is available. Then in my case with fTPM devices, all TPM using and encrypted > rootfs creating services need to depend on the service which starts > tee-supplicant > but only if /dev/teepriv0 exists. If teepriv0 doesn't exist, then > tee-supplicant > should not be started and the dependencies to it should not exist > either. Is /dev/teepriv* guaranteed to be available when userspace is invoked? or is it something that itself requires some kmod loading to show up, i.e. that "udevadm trigger" causes to load? > How should this dependency be expressed in systemd services? > > Can tee-supplicant@.service include: > > Before=systemd-pcrphase-initrd.service systemd-pcrphase.service > systemd-pcrmachine.service > WantedBy=systemd-pcrphase-initrd.service systemd-pcrphase.service > systemd-pcrmachine.service > > In my testing this does not seem to work inside initramfs. > > If systemd-pcrphase-initrd.service systemd-pcrphase.service and > systemd-pcrmachine.service > service have After= and Wants= to tee-supplicant@teepriv0.service then things > work, > except on boards which have no optee and no /dev/teepriv0 where > tee-supplicant seems > be started and fails due to missing optee which breaks the initramfs boot. For your usecase the new tpm2.target available in git main is what you really should focus on: all TPM using services should order themselves after that. All stuff needed to make a TPM device appear should be placed before that. The systemd-tpm2-generator that now exists in git main analyzes the uefi/acpi firmware situation and automatically adds a dev-tpm0.device dependency on tpm2.target if it comes to the conclusion that such a device will show up. This generator is not going to cover your specific case, but I think it would be a good blueprint for you: i.e. write a generator that checks if /dev/teepriv* exists. If not, just exit. If yes, generate the required deps to pull in tee-supplicatnt@.service, and add the dev-tpmrm0.device dep just like systemd-tpm2-generator does. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Issues supporting systems with and without TPM and firmware TPM (was Re: Handle device node timeout?)
On Fr, 16.02.24 11:28, Mikko Rapeli (mikko.rap...@linaro.org) wrote: > Support for fTPM devices is problematic. First, the kernel support must be > modules > but loading needs to be specially handled after starting tee-supplicant. For > normal > boot udev handles optee detection and triggers tee-supplicant@teepriv0.service > startup which unloads tpm_ftpm_tee kernel module, starts tee-supplicant and > then > loads the kernel module again. After this RPMB works. To do the same in > initramfs, I added > Wants: and After: dependencies from systemd-repart.service, > systemd-cryptsetup@.service, > systemd-pcrmachine.service and systemd-pcrphase-initrd.service: Kernel module unloading is not supposed to happen in clean codepaths. It's a debug/development feature, it's not safe to do as part of regular boot. But why do you need an unload a kernel module at all? that smells... Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Handle device node timeout?
On Di, 16.01.24 16:06, Mikko Rapeli (mikko.rap...@linaro.org) wrote: > Hi, > > I have services which depend on a specific device node. How can I run > some recovery actions when the default 90s timeout for finding this > device is hit? > > OnFailure= doesn't work as the service is not even started. > > Specifically the case is about supporting TPM2 encrypted rootfs but falling > back to plain-text rootfs generation if there is no TPM2 device. Currently > my initramfs works with TPM2 but without it fails with: In git main there's new infra to deal with this case: https://github.com/systemd/systemd/pull/30194 That should hopefully solve this systematically and generically. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] logind: Activating session/opening seat fails in systemd v254
On Do, 15.02.24 22:16, Nils Kattenbeck (nilskem...@gmail.com) wrote: > Hi everyone, > > I am working on a kiosk-type device which is supposed to start a > weston instance upon boot. > Our images were previously based on Debian 12 and Fedora 38, now we > are working on unifying them. Between the two old image variants the > systemd units were mostly identical, however, on Fedora 39 with > systemd 254 they no longer work. Weston/libseat now fails with the > message: "Could not activate session: Permission denied". (Also see > the logind logs at the end). Neither Weston nor libseat (whatever that is) are a systemd thing. Please contact the relevant projects for help? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Scan all USB devices from Linux service
On Mi, 14.02.24 20:24, Muni Sekhar (munisekhar...@gmail.com) wrote: > HI all, > > USB devices can have multiple interfaces (functional units) that serve > different purposes (e.g., data transfer, control, audio, etc.). > > Each interface can have an associated string descriptor (referred to > as iInterface). The string descriptor provides a human-readable name > or description for the interface. > > >From user space service utility, How to scan all the USB devices > connected to the system and read each interface string > descriptor(iInterface) and check whether it matches "Particular > String" or not. You can use sd-device.h, allocate an sd_device_enumerator_new(), then apply some filter via sd_device_enumerator_add_match_sysattr() and then enumerate through it via sd_device_enumerator_get_device_first()/sd_device_enumerator_get_device_next(). Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Issue with systemd-logind
On Mi, 14.02.24 15:03, Akshaya Maran (akshayamara...@gmail.com) wrote: > Hi, > > I am trying to run weston11.0.1 using systemd logind launcher but got this > error > " logind: failed to get session seat > logind: cannot setup systemd-logind helper error:" This looks like an error message from some weston thing. Please ask that community for help. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] ConditionNeedsUpdate, read-only /usr, and sysext
On Mi, 07.02.24 20:42, Valentin David (m...@valentindavid.com) wrote: > Hello everybody, > > The behavior of ConditionNeedsUpdate is that if /etc/.updated is > older than /usr/, then it is true. > > I have some issues with this. But maybe I do not use it the right > way. > > First, when using a read-only /usr partition (updated through > sysupdate), the time of /usr is of the build of that filesystem. In > the case of GNOME OS, to ensure reproducibility bit by bit, we set > all times to some time in 2011. So that does not work for us. Hmm, I wonder if the os-release file in /usr/ should optionally have a timestamp field which could be used. That could be directly initialized from $SOURCE_DATE_EPOCH at build time (maybe the field should even be named like that). I think that would make sense, no? > But now let's say we work-around that, and we make our system take a > date that is reproducible, let's say the git commit of our > metadata. Then we have a second issue. > > Because of systemd-sysext, it might be that /usr is not anymore the > time of the /usr filesystem, but the time of a directory created on > the fly by systemd-sysext (or maybe it keeps the time from the / > fileystem, I do not know, but for sure the time stamp is from when > systemd-sysext was started). If systemd-update-done happens after > systemd-sysext (and it effectively does on 254), then the date of > /etc/.updated will become the time when systemd-sysext started. Uh. That'd be a bug. Can you file an issue about this? > Let's imagine that I do not boot that machine often. My system is > booting a new version. And there is already another new version > available on the sysupdate server. My system will download a build > of /usr that is likely to be older than the boot time. So next > reboot, the condition will be false, even though I did have an > update. And it will be false until I download a version that was > built after the boot time of my last successful update. > > So my question is, is there plan to replace time stamp comparison > for ConditionNeedsUpdate with something that works better with > sysupdate and sysext? Maybe copying IMAGE_VERSION from > /usr/lib/os-release into /etc/.updated for example? Yeah, we should fix this. I have so far never though about the mixture of sysext and ConditionNeedsUpdate=. This is unchartered territory. But I think we can fix this. But please open issues about this. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] What creates a new machine-id ?
On Do, 08.02.24 09:35, Agrain Patrick (patrick.agr...@al-enterprise.com) wrote: > Hello, > > Our embedded system is based on a Rocky Linux 8 distribution which embeds > systemd-239. > > At first bootup, a machine-id is created and remains persistent over the > following reboots. > System upgrade sometimes creates a new machine-id, sometimes not. > By 'system upgrade', I mean either new linux kernel or upgraded Rocky > packages or both. > > Could you precise me what event(s) in the previous upgrade cases > trigger a new machine-id ? See: https://www.freedesktop.org/software/systemd/man/latest/machine-id.html#Initialization Or in other words: the machine ID is supposed to be persisted in /etc/. if your upgrade procedure somehow causes the machine ID to be invalidated somehow, then we'll assign a new one though. We basically make sure that whatever happens, on boot we initialize it. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Detecting Systemd crash
On Sa, 03.02.24 16:55, Álvaro Cebrián Juan (acebrianj...@gmail.com) wrote: > Great question! > > I am very interested in detecting systemd crashes too since I have > experienced them recently and have been asked to come up with a solution to > react when a PID1 crash happens. > In fact, in my recent experiences, a journald crash was enough to render > the system into an unreliable/degraded state in which some top-level > applications worked while others didn't. > > So adding to David's 1st question, I need to detect systemd and journald > crashes and then trigger a `systemctl reboot --force --force` > command As mentioned elsewhere in this thread just use RuntimeWatchdogSec= in systemd-system.conf(5) Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Detecting Systemd crash
On Mo, 05.02.24 13:54, Lennart Poettering (lenn...@poettering.net) wrote: > you can just use the usual hw watchdog. If pid1 dies it will not ping > the hw watchdog, and thus a reset is triggered automatically. In fact > we actually configure the hw watchdog by default these days on hw that > has it (which are most PCs). Actually, we don't really, I need to correct myself. We probably should though, dunno. See RuntimeWatchdogSec= in systemd-system.conf(5) > > > 2: How do I get Systemd to freeze to test such program? I mean, if I kill > > Systemd, the kernel would crash so I have to somehow tell Systemd to freeze? > > Not really, the kernel blocks SIGSTOP for PID1. > > Lennart > > -- > Lennart Poettering, Berlin Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Detecting Systemd crash
On So, 04.02.24 00:06, David Timber (d...@dev.snart.me) wrote: > Systemd crashed on me the other day. I was writing up some Systemd units and > testing them out by daemon-reload every time I wanted to test them out. Not > the best way to go on about, I know. My bad abusing Systemd to the point of > crashing. Perhaps it was just a bit flip that caused this. > >systemd[2368]: Assertion 'path_is_absolute(p)' failed at >src/basic/chase.c:628, function chase(). Aborting. >systemd[1]: Assertion 'path_is_absolute(p)' failed at >src/basic/chase.c:628, function chase(). Aborting. >systemd[1]: Caught from our own process. >systemd-coredump[32497]: Due to PID 1 having crashed coredump >collection will now be turned off. >systemd-coredump[32497]: [] Process 32496 (systemd) of user 0 >dumped core. >systemd[1]: Caught , dumped core as pid 32496. >systemd[1]: Freezing execution. > >... > >systemd-journald[871]: Failed to send stream file descriptor to >service manager: Transport endpoint is not connected > > I didn't even bother trying producing stack trace. I can get on that if > anyone wants it. My machine started doing some weird things like > Firefox not If this is a current systemd version (v255), please generate a stack trace and submit it as github issue to us, we'll look into it. If it's older, please report to your distro first. > being able to do Ajax properly whilst being able to go to a new page, > Chromium not being able to create a new tab whilst all the text editors > worked just fine, all the systemctl commands timing out. So basically, I was > using Linux without fork(). Anyway. > Well, I think any software can crash for any reason whatsoever. The > problem Yeah, an assert like the above is an error we need to fix in systemd. > with Systemd I realised from this incident is that I had no way of knowing > that Systemd had crashed until I opened up the journal and kernel logs and > saw that Systemd had crashed some time ago. In this particular incident, > Systemd caught the signal and decided to just freeze. No idea why you'd want > that because if it had just crashed, the kernel would have just panicked and > I would have realised something went wrong. > > 1: So I decided that I need a some sort of "watchdog" that warns me when > something like this happens. Using dbus to poll the status of the Systemd > process, it could be a GUI app running under a seat, just a daemon that > writes a warning message using `wall` or just send mail using a primed up > MUA process. I wonder if someone already had the same idea and went on to > make one. you can just use the usual hw watchdog. If pid1 dies it will not ping the hw watchdog, and thus a reset is triggered automatically. In fact we actually configure the hw watchdog by default these days on hw that has it (which are most PCs). > 2: How do I get Systemd to freeze to test such program? I mean, if I kill > Systemd, the kernel would crash so I have to somehow tell Systemd to freeze? Not really, the kernel blocks SIGSTOP for PID1. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] systemd-pcrlock Failed to submit super PCR policy
On Mo, 05.02.24 09:24, Dominick Grift (dominick.gr...@defensec.nl) wrote: Please run "SYSTEMD_LOG_LEVEL=debug systemd-pcrlock make-policy" from the command line, then file a github issue about this, and pastethe output there. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Systemd units complains about cgroup with 5.15.x kernel
On Do, 01.02.24 16:30, Thierry Bultel (thierry.bul...@linatsea.fr) wrote: > Hi, > > I am using systemd v255, > and currently using a kernel vendor branch : > > g...@github.com:varigit/linux-imx.git > lf-5.15.y_var01 > imx_v7_defconfig > > I had no issue with the older 5.4 kernel. > > I have verified that the kernel has the following options: > > CONFIG_DEVTMPFS=y > CONFIG_CGROUPS=y > CONFIG_INOTIFY_USER=y > CONFIG_SIGNALFD=y > CONFIG_TIMERFD=y > CONFIG_EPOLL=y > CONFIG_UNIX=y > CONFIG_SYSFS=y > CONFIG_PROC_FS=y > CONFIG_FHANDLE=y > > CONFIG_NET_NS=y > > CONFIG_SYSFS_DEPRECATED is not set > > CONFIG_AUTOFS_FS=y > CONFIG_AUTOFS4_FS=y > CONFIG_TMPFS_POSIX_ACL=y > CONFIG_TMPFS_XATTR=y > > ---> > > systemd is failing to start some units: > > systemd[1]: wpa_supplicant.service: Failed to create cgroup > /system.slice/wpa_supplicant.service: No such file or directory > and also; > (agetty)[217]: serial-getty@ttymxc0.service: Failed to attach to cgroup > /system.slice/system-serial\x2dgetty.slice/serial-getty@ttymxc0.service: No > medium found > > ... and I do not have a serial console. > > I am currently digging into systemd code to find out what is possibly wrong > .. but if anyone gets a clue, I would appreciate ! Educated guess, you have no cgroupvs2 or so? Would make sense to provide logs?, use strace to check what precisely fails? Ask you distro for help? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Delaying VM startup until block devices are available
On Do, 25.01.24 16:28, Orion Poplawski (or...@nwra.com) wrote: > We have various VMs that are back by luks encrypted LVs. At boot the volumes > are decrypted by clevis. The problem we are seeing at the moment is that the > VMs are started before the block devices are decrypted. Our current > solution is: We generally wait for all devices listed in /etc/crypttab, unless you set noauto or nofail. > > # cat /etc/systemd/system/virtqemud.service.d/override.conf > [Unit] > After=blockdev@dev-mapper-luks\x2dbackup.target > blockdev@dev-mapper-luks\x2dvm\x2d01\x2ddisk0.target > > Where we list each of the volumes to be decyrpted as blocking the virtqemud > service. > > Does anyone have any better alternatives? My main issue it that it feels > somewhere in between fine-grained and coarse-grained control. > > Ideally I think one would be able to have each individual VM startup > automatically delayed until the devices each used became available, but I > don't see how to do this. I am not sure how libvirt works, but if it runs every VM in a systemd unit, then you could just order the device before that unit, or the unit after the device. Really depends on how libvirt splits things up. > Alternatively it seems like one should be able to delay all VM startup until > all volumes in /etc/crypttab were unlocked, rather than having to specify each > one. But I don't see a target for that. This is default behaviour. Anything listed in /etc/crypttab is ordered before cryptsetup.target, which is ordered before sysinit.target, which is ordered before basic.target, which is ordered before regular services. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Bump: Testing LogFilterPatterns= on user-level services
On Do, 25.01.24 22:29, Farblos (akfkqu.9df...@vodafonemail.de) wrote: > Hi. > > I sent below mail some week ago, Barry's reply left me unsure as to > whether this would be a bug or not. I still tend do assume that I'm > "doing something wrong". This is currently not supported. The filters are communicated by the service manager to journald via xattrs on the cgroups, and journald will only consider those for cgroups owned by root, i.e. not on cgroups delegated to unpriv users like this done for systemd --user instances. Interepreting arbitrary regexes configured by unpriv code in priv code comes at some risk,. becose afair constructing them can come at O(2^n) time, i.e. a rogue regex could make use consume unbounded time on processing journal messages. Hence, I wouldn't hold your breath. Unless someone figures out a smart way to deal with this it's unlikely to be supported. We should document this however I guess. Hence if you file an issue that would be more than welcome, so that we can keep trakc of this. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Permanently remove services
On Do, 18.01.24 23:40, Nils Kattenbeck (nilskem...@gmail.com) wrote: > > > They are turning up as failed units, so they are being run, > > > even if I don't have any TPM module. Also, I have a notifier in > > > my waybar telling me of failed services and I don't want to see > > > them there. > > > > Can you provide logs about this? The goal is definitely to make these > > NOPs on TPM-less systems. I am a bit puzzled that the conditioning > > they come with is not sufficient. We might need to tweak something > > there then. > > > > The idea is that the system does TPM setup on systems that have a tpm > > and on systems lacking that silently just skips all these so that > > everything always works fully automatically and robustly without any > > ugly error output. > > > > hence, any chance you can provide logs about this? and what kind of > > system is this? i.e. does it really lack a tpm? > > In the past I have seen errors on systems which do not have > libtss2/tpm2-tss installed though I am not sure if those should be > silenced. After all, the unit being enabled means that one wants to > use it if possible - and if the libraries are missing that should be > noticeable to the user instead of a silent fail. No, the libs are installed, that's what the "systemd-creds has-tpm2" output shows. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Permanently remove services
On Do, 18.01.24 22:53, Morten Bo Johansen (morte...@hotmail.com) wrote: > ~/ % systemd-creds has-tpm2 > partial > +firmware > -driver > +system > +subsystem > +libraries OK, so this indicates that your system has TPM support on all levels with a single exception: you lack an actual linux driver for your specific hw. And that puzzles me. because to my knowledge at least linux should support all relevant tpm2 interfaces just fine. THis suggests that you haven#t got the right modules installed. i don't know arch but is there possibly some extra package you have to install to get more drivers? tpm2 drivers are super basic stuff, it sound really weird to me to split this out. It's a condition this stuff indeed is not prepared for though: that everything is set up properly, from firmware to kernel to userspace, but the driver is not actually available. > The output from journalctl --unit systemd-tpm2-setup-early.service: > >-- Boot b3fca98d73f6441590174a72ac0d27fa -- >jan 18 18:13:02 gatsby systemd-tpm2-setup[329]: Failed to create TPM2 > context: State not recoverable >jan 18 18:13:02 gatsby systemd-tpm2-setup[329]: > ERROR:tcti:src/tss2-tcti/tcti-device.c:451:Tss2_Tcti_Device_Init() Failed to > open specified TCTI device file /dev/tpmrm0: No such file or direc> >jan 18 18:13:03 gatsby systemd[1]: systemd-tpm2-setup-early.service: Main > process exited, code=exited, status=1/FAILURE >jan 18 18:13:03 gatsby systemd[1]: systemd-tpm2-setup-early.service: > Failed with result 'exit-code'. >jan 18 18:13:03 gatsby systemd[1]: Failed to start TPM2 SRK Setup (Early). > > There is a /dev/tpm0 file but not a /dev/tpmrm0 file Oh, interesting. Is it possible that your system has only a TPM 1.2 device? (maybe your bios allows switching between TPM 2.0 and 1.2 modes) It could be that we simply misdetect the tpm 1.2 case, i admittedly never tested things on such a system. how old is that PC? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Permanently remove services
On Do, 18.01.24 22:26, Morten Bo Johansen (morte...@hotmail.com) wrote: > On 2024-01-18 Lennart Poettering wrote: > > > hence, any chance you can provide logs about this? and what kind of > > system is this? i.e. does it really lack a tpm? > > I shall try to accommodate you. How do I get the log? > > The command "systemctl --plain --no-legend list-units --state=failed" > does not provide enough info. ideally boot with "systemd.log_level=debug" on the kernel cmdline, and then paste "journalctl -b" somewhere. The full output of "systemd-creds has-tpm2" would be good too. > I have no external TPM module installed and I don't think my > rather old cpu, "Intel(R) Core(TM) i5-4570T CPU @ 2.90GHz", has > any on-board TPM2 capablility? That sounds fairly recent, so I would assume that your machine has a TPM. Which OS is this? Is it possible that your kernel has TPM2 support enabled, but for some reason the driver for your hw is not available (for example not included in the initrd)? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Permanently remove services
On Do, 18.01.24 19:43, Morten Bo Johansen (morte...@hotmail.com) wrote: > On 2024-01-18 Andy Pieters wrote: > > > Not being funny, but why care? They have got a conditional check in them > > and will only run when it makes sense. > > So these units will do nothing and won't delay your boot or take up > > resources > > They are turning up as failed units, so they are being run, > even if I don't have any TPM module. Also, I have a notifier in > my waybar telling me of failed services and I don't want to see > them there. Can you provide logs about this? The goal is definitely to make these NOPs on TPM-less systems. I am a bit puzzled that the conditioning they come with is not sufficient. We might need to tweak something there then. The idea is that the system does TPM setup on systems that have a tpm and on systems lacking that silently just skips all these so that everything always works fully automatically and robustly without any ugly error output. hence, any chance you can provide logs about this? and what kind of system is this? i.e. does it really lack a tpm? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Activation environment(s)?
On Fr, 12.01.24 18:16, Vladimir Kudrya (vladimir-...@yandex.ru) wrote: > On 08/01/2024 22.26, Simon McVittie wrote: > > It is not possible to unset a variable in the dbus-daemon's activation > > environment, or with `dbus-update-activation-environment`: that's an > > API limitation in the org.freedesktop.DBus interface. I've thought about > > adding an UnsetAndSetActivationEnvironment() that could do this. > > > > It *is* possible to unset a variable in the `systemd --user` > > activation environment, with `systemctl --user unset-environment` or > > the UnsetEnvironment() and UnsetAndSetEnvironment() D-Bus methods on the > > systemd manager. If your distribution is using dbus-broker rather than > > dbus-daemon, and if Mantas was correct to say that dbus-broker does not > > have its own separate activation environment, then that should be enough > > to affect all D-Bus session services. It will also affect all other > > systemd user services. > > Thank you. I now recommend dbus-broker in my session manager's readme > (https://github.com/Vladimir-csp/uwsm), and management of dbus activation > environmentis now conditional on dbus unit true name not being > dbus-broker.service. > > BTW, the whole reason I even decided to interact with dbus is rather > aggressive session termination by systemd. It seems to send signals not only > to existing processes in the session, but even to new ones which were > spawned in response to those signals. This makes it impossible to fork a > systemctl process to stop related user units. > > I solved this by interacting with dbus without spawning new processes, but, > just for info, is there a proper way to fork something that survives for a > bit in a session that is being terminated? With simple tools like `trap > 'something' TERM HUP` in a shell? Or maybe there is some other more native > way to bind a user level unit to a particular session scope? When the goal is to shut down a service/session, then intend to give guarantees that the shut down time is bounded: we first send SIGTERM, and start a timeout. If by that timeout there are still processes left we SIGKILL to put an end to things. If we'd somehow distinguish new/old processes then we couldn't put the boundary on the shutdown process... So no, this does not exist. You can fork if you want, but it won't add time to the time-out. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Potential systemd CoredumpFilter sandboxing issue
On Mo, 08.01.24 04:04, daechir (daec...@protonmail.com) wrote: > Hello again, > Thanks for fixing the utmp build issue from Nov 2023. I lost the email and > couldn't figure out how to write to it. > > I found another issue that seems to be a bit more complicated. I'll try to > describe it as best I can. > > When booting with the kernel parameter coredump_filter=0x0, all > processes should read coredump_filter (at /proc/*/coredump_filter) > as , or private-anonymous. This behavior works as > intended. However, when specifying this kernel parameter, and also > setting the systemd sandboxing option > CoredumpFilter=private-anonymous, some services still tend to ignore > or overwrite this value. I have found with v255 that > /usr/lib/systemd/systemd --user is one such example, or > user@.service which sets its /proc/*/coredump_filter to 0001 > instead. As per kernel docs the kernel command line option only sets the *default*, i.e. userspace can override it. So the behaviour works as intended? Quoting kernel docs: coredump_filter= [KNL] Change the default value for /proc//coredump_filter. > Am I wrong in understanding that private-anonymous usually maps to ? > Also, wouldn't 0001 show something like coredump_filter=0x01 or > CoredumpFilter=shared-anonymous? I cannot parse this. Lennart -- Lennart Poettering, Berlin
Re: Can mkosi replace Kickkstart / Calamares?
On Mo, 25.12.23 02:39, Patrick Schleizer (patrick-mailingli...@whonix.org) wrote: > Hi, > > I am maintaining a systemd, Debian-based Linux distribution (Kicksecure) and > am considering moving to mkosi as the "base image creation tool". > > It seems mkosi is a fine OS image builder. With systemd-repart, you even > solved the resizing of partitions at the first boot, which is magic. > > Suppose a Linux distribution is providing an OS image that can be written to > USB. Maybe soon, even to a CD/DVD. [1] > > Suppose that OS image is supposed to be able to act as an installer, so the > user can use it to install it on an internal hard drive. > > Is something like Kickstart or Calamares still required? It seems (at least > Calamares, whose code I am reading) is kind of "yet another OS image > builder". It doesn't build an image but instead writes to a hard drive. > However, I find it problematic that a lot of code (creating partition > tables, creating file systems, making bootable) is duplicated. [2] I don't really know what Kickstart/Calamares really do. But it's certainly our intention to allow systemd-repart to operate like an installer, in the sense that you boot from a USB stick and you can use systemd-repart to copy the relevant partitions you just booted from to a target disk very efficiently, which will then be basically the same OS, just with maybe differently sized data/home partitions, new uuids, different crypto keys and such. More specifically, systemd-repart + bootctl install + systemd-firstboot is supposed to be enough to do what a classic installer disk can do on traditional OSes. Note that currently there are still some gaps, but people are workng on this in various places. > Do you have any suggestions? > > Did you envision replacing installers, or do you already have tools for > that? Well, depends on what you mean by "installers". We certainly have no interest to replace a package-based installer. But we certainly do want to provide you with basic tools which you can combine into an A/B image-based OS installer > [2] But what about installer questions, customization such as time zone, > keyboard layout? I think the crucial question for an installer is the target > drive, and that's it. Perhaps partitioning and file system choices, but that > is more for geeks. How about time zone, keyboard layout? Valid points. But I > think those would be better handled through a first-boot GUI wizard. systemd-firstboot is supposed to be just that – but it only covers the offline and console cases. It's also supposed to be useful as a blueprint to implement something similar in a graphical tool. systemd-firstboot can be used in two modes. In "offline" mode, where you call it from the cmdline and specify --root= or --image= to let it operate directly on an OS tree you mounted somewhere or on a block device/image file you have accessible. Or in "online" mode where it is run at first boot, and asks the user interactively. systemd-firstboot covers hostname, locale, keyboard, timezone, root pw currently. In systemd git main you also fine a "homectl firstboot" command which can prompt the user interactively for a user to create at boot. Regarding partitioning: my thinking was that installers would ship multiple alternative sets of repart .conf files, of which the first that can be applied is applied or of which the user can pick one explicitly, depending on the use case. The focus is clearly on automatic partitioning here though, if people want to manually and precisely set the sizes of each partition in a UI, then repart is not the tool they should use. Lennart -- Lennart Poettering, Berlin
Re: systemd-sysupdate support for slow rollout (aka A/B testing)
imited to no interest over precise > control of updates and user devices and the users wish for anonymity. > On the other hand though are enterprises which deploy sysupdate for > (I)IoT devices. In these case devices commonly have to be registered > anyhow, and the enterprise controls how updates are rolled out etc. In > these cases anonymity is not necessary and instead customers often pay > the enterprise to perform all the management on their behalf. I think adding some concept for this would be entirely fine, but this really should be opt-in. Happy to review a patch for this. I think in the longer run we need to hook this up with remote attestation though, i.e. instead of just including the machine ID, include a quote from the TPM about PCR 15 (which includes a measurement of the machine ID), signed by some suitable local TPM key. That would make it a bit harder for clients that were hacked to play games with you, and report incorrect machine IDs. Lennart -- Lennart Poettering, Berlin
Re: systemd-sysupdate support for slow rollout (aka A/B testing)
On Di, 02.01.24 13:11, Simon McVittie (s...@collabora.com) wrote: > Prior art: Debian/Ubuntu apt does slow rollout for packages like > this, with simple filesystem-based http mirrors combined with "smart" > clients. It works by adding a Phased-Update-Percentage field to the > metadata of each package. The client calculates some sort of ID for itself > (I don't know precisely how), and then takes the upgrade if it finds that > its ID is in the first x% of the available range. > > If I understand correctly, Ubuntu is using this mechanism in production > but Debian is not. > > Using some sort of hash of the machine ID + the proposed version would > probably have the behaviour you want, of choosing a different x% of > machines to be the early-adopter set for each update? Yes, this is what I think would be the right approach. > > This would then mean for the server that it would first serve > > foobar_47.11_3.raw which would be version 47.11 of the OS, and 3% of > > the hosts would update to it. And then, once you collected enough > > feedback you'd rename the file to foobar_47.11_25.raw and 25% of the > > hosts would switch over. Finally you'd set the value to 100 (or maybe > > just drop it, which should be considered equivalent to 100), and then > > all remaining hosts would update. > > If you're using a hash of the machine ID + the proposed version as > your randomization, then I think you'd want to have a single image (or > version ID, or some other unique identifier) for each proposed update, and > separately, a metadata field that sets *x* in the instruction "if you have > figured out that you are in the first x% of machines, upgrade". Otherwise, > publishing foobar_47.11_3.raw followed by foobar_47.11_25.raw would be > more likely to result in approximately (3% + 25% = 28%) of machines > upgrading[1], because the client doesn't know that it's actually the > same update and would "re-roll the dice" for each republished name. My thinking was that clients would look at multiple entries which only differ by the percentage (i.e. are identical in name and version) and drop all of them but the one with the highest percentage, and ignore all others. Lennart -- Lennart Poettering, Berlin
Re: sysupdate: Limit update to at most one major version
On Di, 02.01.24 13:49, Nils Kattenbeck (nilskem...@gmail.com) wrote: > > I'd be fine with adding MaxVersion=. Happy to review a patch, merge > > something like this (at least file an RFE issue) > > Should that be inclusive or exclusive? Naming it MaxVersion would > imply it to be inclusive though an exclusive bound would likely be > more useful most of the time. One could then specify MaxVersion=1.3.0 > in their 1.2.x images and once they have an upgrade path they would > explicitly raise the max version in e.g. 1.2.15. Otherwise they would > have to specify 1.99.99. > In retrospect a VersionBound= property with syntax similar to > ConditionKernelVersion= would have been better though I guess that > ship has sailed - or is it? Is sd-sysupdate still considered > experimental? Not sure if this warrants such a change though :shrug: We do not allow "=" nor "<" in version strings, as per https://uapi-group.org/specifications/specs/version_format_specification/. Hence we could use that fact and say: "MaxVersion= <=47.11", "MaxVersion= <47.11" could be used to make the type of version comparison explicit. This would implement a tiny subset of the ConditionKernelVersion= logic, and simply default to imply <= if the comparison is not specified explicitly. Of course, a similar logic should then be implemented for MinVersion, i.e. >= and > > Should we continue this discussion on the mailing list or an issue? Issue is better. Lennart -- Lennart Poettering, Berlin
Re: sysupdate: Limit update to at most one major version
On So, 31.12.23 14:43, Nils Kattenbeck (nilskem...@gmail.com) wrote: > Hello, > > we are currently using sd-sysupdate to roll out updates and we're wondering > if there is any possibility to limit updates to consider at most one next > major version. This would allow us to write the software to handle only > data migrations from the previous major version instead of any version > beforehand. > The only thing I have been able to find is MinVersion= which seems to do > exactly the opposite of what we would want to do. I'd be fine with adding MaxVersion=. Happy to review a patch, merge something like this (at least file an RFE issue) Lennart -- Lennart Poettering, Berlin
Re: systemd-sysupdate support for slow rollout (aka A/B testing)
On Mi, 20.12.23 19:04, Nils Kattenbeck (nilskem...@gmail.com) wrote: > Hey everyone, > > does sysupdate currently support any way to slowly roll out updates > where the server providing the files can be in control? This would be > used to slowly make a new version available and have it at e.g. 1% > adoption for a day to monitor regressions before increasing the > coverage. I was unable to find any information about it in the > documentation. This is currently not available, no. The idea so far was always that the server is dumb, and the client picks the release it wants. I have thought about this usecase a while back, and my thinking was that such a staged update logic should be driven by the machine ID. i.e. we should teach sysupdate a simple logic that allows pattern matching of new versions based on some arithmetic of the machine ID. More specifically, include some value in the URL pattern that indicates the percentage of hosts that shall update to this release. Then, each client takes its machine ID, treats it as an integer and calculates modulo 100 of it or so, and then checks if the resulting value is below the intended percentage, and if so it updates, otherwise it doesn't. (or something like that, the above is probably not ideal, since it would mean it's always the same hosts that try a new release first, and it probably should be evened out across the set of clients). This would then mean for the server that it would first serve foobar_47.11_3.raw which would be version 47.11 of the OS, and 3% of the hosts would update to it. And then, once you collected enough feedback you'd rename the file to foobar_47.11_25.raw and 25% of the hosts would switch over. Finally you'd set the value to 100 (or maybe just drop it, which should be considered equivalent to 100), and then all remaining hosts would update. The effect of this is that client's could still explicitly upgrade if they want, and the updates would be entirely driven by the clients, but simply via naming the download images the server can control that "by default" only the chosen number of clients update. > Currently it seems like I would have to implement a different service > which calls the sysupdate binary (or uses dbus once #28134 has landed) > and then decides based on some other information. > > One idea I had would be that systemd-pull could send the machine-id > based on which the server could then decide to provide the newer file > (e.g. last two chars == "00" would roll it out to ~1/255). Though I am > not sure if sd-pull is supposed to be "anonymous", i.e. do not provide > this identifying information. Another drawback of this would be that > stateless systems which reboot often get a new machine-id each boot, > thus having an increased chance to get the newer version. So this idea is not entirely different from my idea, I was just thinking about pushing this into sysupdate rather than pull. > Does anything like this already exist or is planned? Or should that be > done by different applications on the client side? I think it makes a ton of sense to add this to sysupdate. Would love to review/merge a patch for that. > I also remember there being a discussion about plugging in different > sd-pull like implementations/backends[1] to support delta updates, > other transports, or TLS client authentication. This could at least be > adapted to support my idea to send the machine-id as an HTTP header > (e.g. X-MACHINE-ID). If we can avoid it, I'd always adopt a logic whether identifying info doesn't have to be sent to the server. After all the logic should be generic and applicable in scenarios where the client should get anonymity as much as it wants. The machine-id we usually consider a "half-secret", i.e. all local programs get access to it (unless sandboxed), but they are not supposed to be send it across the wire. If they really need to send some identifier across the wire they should derive an app-specific ID instead, which we make easy to acquire via sd_id128_get_machine_app_specific(). But better than app-specific machine IDs are no machine IDs at all in the protocol, if we can get away with it. Hence, my idea of doing the rollout percentage logic client-side. Lennart -- Lennart Poettering, Berlin
Re: Query on dynamic update of Kernel comandline
On Mi, 13.12.23 10:28, HARSHAL PATIL (QUIC) (quic_hgpa...@quicinc.com) wrote: > Hello Fellow Community, > > I have a following question. Your help will be really appreciated. > > "Kernel expects few params from kernel cmdline from underlying > firmware. Is there any mechanism to dynamically update the cmdline > updated by firmware (UEFI) during boot time in systemd-boot similar > to DT fixup protocol ?" I don't understand the question. Why would stuff from the UEFI firmware be added to the kernel cmdline? and what does that have to do with the DT fixup protocol? Various UEFI interfaces are available from userspace anyway, are you sure that whatever data you are looking for isn't readily available from /sys/ anyway? why must it be kernel command line? > Example : androidboot.serialno is read from firmware and needs to be appended > to kernel commandline I don't know what this is, and what that has to do with uefi, sd-boot or dt? Anyway, the question is very confusing, I am not surprised noone answered so far. Lennart -- Lennart Poettering, Berlin
Re: [RFC] initoverlayfs - a scalable initial filesystem
On Do, 14.12.23 02:17, Nils Kattenbeck (nilskem...@gmail.com) wrote: > On Wed, Dec 13, 2023 at 10:03 AM Lennart Poettering > wrote: > > > > On Di, 12.12.23 23:01, Nils Kattenbeck (nilskem...@gmail.com) wrote: > > > > > > sysexts are erofs or squashfs file systems with verity backing. Only > > > > the sectors you access are decompressed. > > > > > > Okay I forgot that they were erofs based and mentioned cpio archives > > > so I assumed they would be one. > > > Do they need to be fully read from disk to generate the cpio archive? > > > > erofs is a file system, cpio is a serialized archive. Two different > > things. The discussion here is whether to pass the initrd to the > > kernel as one or the other. But noone is suggesting to convert one to > > the other at boot time. > > I was referring to the following line from sd-stub's man page: "The > following resources are passed as initrd cpio archives to the booted > kernel: [...] /.extra/sysext/*.raw [...]". I assume the initrd > containing the sysexts has to be created at some point? These cpios are created on-the-fly and placed into memory and passed to the invoked kernel. And yes, for that the data they contian needs to be read off disk first. Lennart -- Lennart Poettering, Berlin
Re: Ton of random units "could not be found"
On Fr, 15.12.23 22:17, chandler (s...@riseup.net) wrote: > Hi all, > > When I run `systemctl status --all` I see a ton of "Unit X could not > be found" where X = an item from the list below. How did this mess > happen and how to clean it up? None of these units represent things the > system is using, for the most part. This is not an issue. As Andrei already answered this just tells you that some services have ordering deps against other units which aren't installed, which is entirely fine. It's just metainfo that if you install some packages in combination the right order is applied. There's a reason why these entries are generally not shown. Except you used "--all", which literally means "Hey, please also show me *everything* you have heard about". Just drop the "--all" from your command line. > Some units appear to be remnants left behind in /etc/systemd, for > example /etc/systemd/system/ntp.service is a symlink pointing to > non-existent /lib/systemd/system/ntpsec.service. I can delete > /etc/systemd/system/ntp.service and after `systemctl daemon-reload` it's > now gone from the list below. That smells like a packaging bug, you removed some package and it forgot to invoke "systemctl disable" from it's pakaging uninstall scripts first. File a bug against yout distro. > Other items have different situations, like tmp.mount exists at > /usr/share/systemd/tmp.mount but isn't an enabled unit or anything, if I > try to enable or unmask it I'm just told "Unit tmp.mount could not be > found." or "Unit file tmp.mount does not exist." /usr/share/systemd/ is not a directory systemd ever looks into for unit files. If debian packaged something there, this smells like a bug. Please report to your distro. Lennart -- Lennart Poettering, Berlin
Re: [RFC] initoverlayfs - a scalable initial filesystem
On Di, 12.12.23 23:01, Nils Kattenbeck (nilskem...@gmail.com) wrote: > > sysexts are erofs or squashfs file systems with verity backing. Only > > the sectors you access are decompressed. > > Okay I forgot that they were erofs based and mentioned cpio archives > so I assumed they would be one. > Do they need to be fully read from disk to generate the cpio archive? erofs is a file system, cpio is a serialized archive. Two different things. The discussion here is whether to pass the initrd to the kernel as one or the other. But noone is suggesting to convert one to the other at boot time. Lennart -- Lennart Poettering, Berlin
Re: networkd RetransmitSec - how to make it work on a host?
On Mo, 11.12.23 02:49, Muggeridge, Matt (matt.muggerid...@hpe.com) wrote: > The RetransmitSec option was introduced in systemd-v255, but I > cannot get it to work for Neighbor Solicitations from a > Host. Instead, I observe that the NS are always transmitted at 1 > second intervals, regardless of whether it was changed by: Please file this as git issue. It sounds like a bug report, which should really go to github. Lennart -- Lennart Poettering, Berlin
Re: systemd units disabled when calling systemctl daemon-reload
On Di, 12.12.23 19:06, Etienne Cordonnier (ecordonn...@snap.com) wrote: > Hello, > I am debugging some embedded system running systemd. The behavior I am > observing is that many systemd targets such as multi-user.target are > disabled after I run systemctl daemon-reload (as shown by systemctl > list-units --type target --all). This causes many systemd units to be > disabled, and forces me to reboot the system. What do you mean by "disabled"? in systemd targets can be active and inactive, and that's what "systemctl list-unit" shows. They can also be enabled/disabled, but that's what "systemctl list-unit-files" shows. But targets such as multi-user.target cannot be enabled nor disabled, they are considered "static", i.e. always enabled if you so will. Which "systemctl list-unit-file" should actually show. Hence, I don#t really grok what you are trying to say here... > Is there a way to debug this systemd target transition? I already > enabled systemctl > log-level debug, but I still don't understand why the systemd target is > changing when I call systemctl daemon-reload on this particular system. Please state OS, systemd version and provide relevant logs. Otherwise this is not actionable. Lennart -- Lennart Poettering, Berlin
Re: [RFC] initoverlayfs - a scalable initial filesystem
On Di, 12.12.23 21:34, Nils Kattenbeck (nilskem...@gmail.com) wrote: > Hi, while I have been following this thread passively for now I also > wanted to chime in. > > > (The main reason why sd-stub doesn't actually support erofs-initrds, > > is that sd-stub also generates initrd cpios on the fly, to pass > > credentials and system extension images to the kernel, and you can't > > really mix erofs and cpio initrds into one) > > What prevents one from mixing the two (especially given that the > hypothetical erofs initrd support does not yet exist)? > Or are you talking about mixing this with your memmap+root=/dev/pmem > suggestion? If you have 7 cpio initrds then the kernel will allocate a tmpfs and unpack them all into it, one after the other, on top of each other, and then jumps into the result. if you have an erofs and 7 cpio initds, what are you going to do? You cannot extract into an erofs, it's immutable. You'd need something like overlayfs, but that would require (at least for now) an additional step in userspace, which is something to avoid. Alternatively (and preferred by me) would support a mode where it would unpack any cpios it gets into a tmpfs, and then pass an fsopen() fd to that to the executable it then invokes from the erofs. the executable could then mount that somewhere if it wants. But this would require a kenrel patch. > Even if everything is the same there are codes paths which might not > be taken during usual operation. An example would be services similar > to the new systemd-bsod which are only triggered in emergencies. > Having these in the cpio means that they will always be read and > decompressed. systemd-bsod is tiny though, less than 8K compressed here. Not sure it is a good example. > Using sysexts also has the drawback that each and every one of them > has to be decompressed. I might be mistaken but I expect that this > will be the case even if the extension-release in the sysext results > in it being discarded which is obviously another big drawback. sysexts are erofs or squashfs file systems with verity backing. Only the sectors you access are decompressed. Lennart -- Lennart Poettering, Berlin
Re: IPv6 Compliance for networkd
On Mo, 11.12.23 19:14, Muggeridge, Matt (matt.muggerid...@hpe.com) wrote: > Hello, networkd developer community, > > I am hoping to rally support for making networkd IPv6 compliant and > I'm will to help, but cannot do it alone. Is there any interest in > making systemd-networkd IPv6 compliant? Well, interest is relative. I think for most people IPv6 already works well enough, they don't really care about compliance programs on this so much. But as long as the requirements are reasonable they also wouldn't mind (and prefer) if networkd passes those qualifications. > There are many organizations (especially US Government) that mandate > IPv6 compliance (USGv6). Products that are dependent on networkd > cannot be bid to these customers. For the people currently involved with networkd upstream this is not a top priority. If this is important to you however, that's great, we are happy to review/merge patches. > How do I engage with the right people in the developer community? Send PRs via github. > Thanks, > Matt. > > PS: Mailing list topics go unanswered and github issues get lost in > the noise, so I'm hoping there's a more efficient way to > collaborate. It's an Open Source project: if something matters a lot to you, then please file PRs to get the work merged. We generally try to review PRs sooner or later, but we are swamped with work, so it might take a while. Just filing issues (while also appreciated) will usually not magically make somebody work on this for you though. It's kinda the same with most open source projects btw. If this is something you'd like to see addressed soon, I'd recommend maybe paying some consultancy (we have worked with codethink on some projects, they should be willing to work on this, are capable and now hot get stuff in systemd done). If you don't have the cash for that, it might work to get funding from this from organizations such as the German STF and things like that. I am pretty sure that the US has something similar? Anyway, judging by your email address I understand you work for HPE, so I'd assume your company actually has the funds to payroll this though, if this matters to you. Lennart -- Lennart Poettering, Berlin
Re: [RFC] initoverlayfs - a scalable initial filesystem
On Mo, 11.12.23 17:03, Eric Curtin (ecur...@redhat.com) wrote: > A generic approach is hard, I think it's worth discussing which type of boots > you should actually care about milliseconds of performance for. It would be > nice > if we had an init system that contained the binary data to do the minimum for > standard Fedora, Debian installs and everything else was an extension whether > that's sysexts, dlopen, a new binary to execute etc. > > If the network is ingrained in your boot stack like this, I'm > guessing you probably don't care about boot performance. Uh, I am not sure that's really true. People boot up VMs on demand, based on network traffic. They sure care about latency and boot times. I mean people care about firecracker and these things precisely because it brings the of off-to-IP to a minimum. > Automotive has an expectation for really fast boots, like 2 seconds, in > standard > desktops installs there's some expectation as you interface directly > with a human, > but for other installs how much expectation is there? AFAIR in particular in cars there's quite som functionality you probaly want to move very early in boot. Which yells to me that you want a service manager super early. Which again suggests to me that the first initrd that runs should probably already cover that. If I were you I'd probably focus on a design like this: ship a basic systemd in an initrd. Complete enough to find the harddisk, and to run the other services that are absolutely necessary this early. Then, once you found the disk, look for sysext images on it, and apply them all on top of the initrd's root fs you are already running with. Never transition anywhere else. The try to optimize the initrd a bit by making it an erofs/memmap thing and so on. And make sure the initrd only contains stuff you always need, so that reading it all into memory is necessary anyway, and hence any approach that tries to run even the initrd off a disk image won't be necessary becuase you need to read everything anyway. Lennart -- Lennart Poettering, Berlin
Re: [RFC] initoverlayfs - a scalable initial filesystem
On Mo, 11.12.23 11:28, Demi Marie Obenour (d...@invisiblethingslab.com) wrote: > I don't think this is "a pretty specific solution to one set of devices" > _at all_. To the contrary, it is _exactly_ what I want to see desktop > systems moving to in the future. > > It solves the problem of large firmware images. It solves the problem > of device-specific configuration, because one can use a file on the EFI > system partition that is read by userspace and either treated as > untrusted or TPM-signed. It means that one have a complete set of > recovery tools in the event of a problem, rather than being limited to > whatever one can squeese into an initramfs. One can even include a full > GUI stack (with accessibility support!), rather than just Plymouth. For > Qubes OS, one can include enough of the Xen and Qubes toolstack to even > launch virtual machines, allowing the use of USB devices and networking > for recovery purposes. It even means that one can use a FIDO2 token to > unlock the hard drive without a USB stack on the host. And because the > initramfs _only_ needs to load the boot extension volume, it can be > very, _very_ small, which works great with using Linux as a coreboot > payload. systemd's "system extension" concept ("sysexts") already allow you to do all that. The stuff I was fantasizing about would only change one thing: instead of sd-stub from uefi mode already putting the sysexts you installed into memory for the initrd to consume, it would be some proto-initrd that would do so. This does not really change what you can do with this, but mostly is just an optimization, reducing iops and memory use a bit, and thus boot time latency. > The only problem I can see that this does not solve is network boot, but > that is very much a niche use case when compared to the millions of > Fedora or Debian desktop installs, or even the tens of thousands of > Qubes OS installs. Furthermore, I would _much_ rather network boot be > handled by userspace and kexec, rather than the closed source UEFI network > stack. Well, somebody's niche is somebody else's common case. In VM/cloud/server scenarios network booting is not that "niche" as it might be on the desktop. > It does require some care when upgrading, as the dm-verity image and the > UKI cannot both be updated atomically, but one can solve that by first > writing the new dm-verity image to a separate location. The UKI will > try both both the old and new locations for the dm-verity image and > rename the new image over the old one on success. The wrong image will > simply fail to mount as its root hash will be wrong. systemd-sysext already covers this just fine: you can encode in their "extension-release" file to which base images they match up, and systemd-syext will then find the right one to apply, and ignore the others. Thus just make sure you drop in the sysexts fist, and the UKI last and things should be perfectly robust. Lennart -- Lennart Poettering, Berlin
Re: [RFC] initoverlayfs - a scalable initial filesystem
On Mo, 11.12.23 12:48, Eric Curtin (ecur...@redhat.com) wrote: > Although the nice thing about a storage-init like approach is there's > basically zero copies up front. What storage-init is trying to be, is > a tool to just call systemd storage things, without also inheriting > all the systemd stack. Just to make this clear: using things like systemd-cryptsetup outside of the systemd stack is not going to work once you leave trivial setups. i.e. the TPM hookup involves multiple services these days, and it's not going to get any simpler. i.e. systemd-tpm2-setup, systemd-pcrextend, systemd-pcrlock and so on. I am sorry, but doing reasonable disk encryption with TPM involved means you either buy into the whole systemd offer (i.e. with the service manager) or you have to rewrite your own systemd. But maybe I am misunderstanding what you are saying here. Lennart -- Lennart Poettering, Berlin
Re: [RFC] initoverlayfs - a scalable initial filesystem
On Mo, 11.12.23 12:48, Eric Curtin (ecur...@redhat.com) wrote: > Sort of yes, but preferably using that __initramfs_start / > initrd_start buffer as is without copying any bytes anywhere else and > without teaching the bootloaders to do things. > > The "memmap=" approach you suggested sounds like what we are thinking, > but do you think we could do this without teaching bootloaders to do > new things? Well, in a standard UEFI world it would suffice to teach the memmap= logic to the stub that is glued in front of the kernel. For example, make sd-stub find the erofs initrd in the UKI, then trivially synthesize a memmap= switch and append it to the kernel command line. but of course, you don't believe in UEFI or good boot loaders, so you kinda dug your own grave here... (The main reason why sd-stub doesn't actually support erofs-initrds, is that sd-stub also generates initrd cpios on the fly, to pass credentials and system extension images to the kernel, and you can't really mix erofs and cpio initrds into one) Lennart -- Lennart Poettering, Berlin
Re: [RFC] initoverlayfs - a scalable initial filesystem
On Mo, 11.12.23 11:42, Eric Curtin (ecur...@redhat.com) wrote: > I am also thinking, what is the difference between "make the > bootloader load the erofs into contiguous memory" part and doing > something like storage-init. Well, from my PoV there's value in reducing the stages of the boot process, and reducing the amount of storage stacks you need in the mix. Hence, the boot loader can load stuff from disk into memory anyway, it always has done that, typically the kernel and the initrd. just swapping out the format of the initrd to get better behaviour is relatively cheap there, means no additional storage logic, no additional stage of the boot. You basically only have "boot loader" (which loads kernel and initrd), and the "host os" (which runs of the final rootfs). Otoh if you let your storage-init load the initrd, then you basically have a third step in the middle, which shares a lot of props with the last step, but also is distinct. I mean, you probably would reinvent your own udev and DM stack for that, to get verity in the mix (because that depends on DM, and udev, to some degree) In my ideal model, initrds are just part of the UKI btw, so they end up being loaded together with the rest of the kernel, and need no verity becaused signed along with the UKI itself. Lennart -- Lennart Poettering, Berlin
Re: [RFC] initoverlayfs - a scalable initial filesystem
On Mo, 11.12.23 11:28, Eric Curtin (ecur...@redhat.com) wrote: > > > For the items listed above I think you can find different solutions > > > which do not necessarily compromise security as much. > > > > > > So, in the list above you could address the latter three like this: > > > > > > 2. Use an erofs rather than a packed cpio as initrd. Make the boot > > >loader load the erofs into contigous memory, then use memmap=X!Y on > > >the kernel cmdline to synthesize a block device from that, which > > >you then mount directly (without any initrd) via > > >root=/dev/pmem0. This means yout boot loader will still load the > > >whole image into memory, but only decompress the bits actually > > >neeed. (It also has some other nice benefits I like, such as an > > >immutable rootfs, which tmpfs-based initrds don't have.) > > What I am unsure about here, is the "make the bootloader load the > erofs into contiguous memory" part. I wonder could we try and use the > existing initramfs data as is. Today's initrds are packed cpio archives of an OS file system hierarchy. What I proposed means you'd have to put the OS file system hiearchy into an erofs image instead. Which is a trivial operation, just unpack and repack. Note that there are two concepts of "initrd" out there. a) from the kernel perspective an initrd/initramfs (which both are badly named, because its a tmpfs these days) is that packed cpio archive that is unpacked into a tmpfs, and then jumped into. b) from systemd's perspective an initrd is an OS image that carries an /etc/initrd-release file. If that file exists then systemd will not boot up the system regularly, but instead just prepare everything that it can transition into some other root fs. While most often in real life the initrds currently qualify under both definitions. But there's no reason to always do this. You can also have images the kernel would consider an initrd, but systemd does not, which is something we use in the "USI" concept, i.e. "unified system images", which are basically UKIs (large UKIs) with a complete rootfs that is the main system of the OS. And you can also do it the other way round, which is potentially what I am suggesting to you here: use an erofs image that would not be considered an initrd by the kernel, but that systemd would consider one, and transition out of. > I dunno if > bootloaders make much assumptions about the format of that data, worst > case scenario we could encapsulate erofs in the initramfs, cpio looking > data. boot loaders generally don't bother with the cpio, it's just "data" for them. Compression algorithms have changed in the past, and it only mattered that the kernel could decompress it, the boot loader doesn't care. > Teach the kernel not to decompress and process the whole > thing and mount it like an erofs alternatively. Does this sound crazy > or reasonable? You are re-inventing the traditional "initrd" logic of the kernel which was a ramdisk (i.e. a block device /dev/ram0), that was filled with some fs of your choice loaded by the boot loader. Lennart -- Lennart Poettering, Berlin
Re: [RFC] initoverlayfs - a scalable initial filesystem
On Mo, 11.12.23 10:57, Lennart Poettering (mzerq...@0pointer.de) wrote: > Which leaves item 1, which is a bit harder to address. We have been > discussing this off an on internally too. A generic solution to this > is hard. My current thinking for this could be something like this, > covering the UEFI world: support sticking a DDI for the main initrd in > the ESP. The ESP is per definition unencrypted and unauthenticated, > but otherwise relatively well defined, i.e. known to be vfat and > discoverable via UUID on a GPT disk. So: build a minimal > single-process initrd into the kernel (i.e. UKI) that has exactly the > storage to find a DDI on the ESP, and set it up. i.e. vfat+erofs fs > drivers, and dm-verity. Then have a PID 1 that does exactly enough to > jump into the rootfs stored in the ESP. That latter then has proper > file system drivers, storage drivers, crypto stack, and can unlock the > real root. This would still be a pretty specific solution to one set > of devices though, as it could not cover network boots (i.e. where > there is just no ESP to boot from), but I think this could be kept > relatively close, as the logic in that case could just fall back into > loading the DDI that normally would still in the ESP fully into > memory. BTW, one thing I would like to emphasize though. i think this item is really the last thing you should focus on. If your OS never transitions out of the initrd, and gets its payload merged in via DDIs, then the root fs should be reasonably small enough and "fully used at boot" (i.e. every sector read anyway) that doing this extra work of finding a split-out DDI on the ESP is entirely unnecessary and just a waste of time (both of developer time and boot time). Lennart -- Lennart Poettering, Berlin
Re: [RFC] initoverlayfs - a scalable initial filesystem
a PID 1 that does exactly enough to jump into the rootfs stored in the ESP. That latter then has proper file system drivers, storage drivers, crypto stack, and can unlock the real root. This would still be a pretty specific solution to one set of devices though, as it could not cover network boots (i.e. where there is just no ESP to boot from), but I think this could be kept relatively close, as the logic in that case could just fall back into loading the DDI that normally would still in the ESP fully into memory. (If you are focussing on systems lacking UEFI, then replace the word "ESP" in the above with a similar concept, i.e. a well discoverable, unauthenticated relatively simple file system, such as vfat). Anyway, I can't tell you how to solve your specific problems, but if there's one thing I'd suggest you to keep in mind then it's the security angle, i.e. keep in mind from the beginning how authentication of every component of your process shall work, how unatteneded disk encryption shall operate and how measurement shall work. Security must be built into things from the beginning, not be added as an afterthought. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Manual start of user@.service failed with permission denied
On Fr, 08.12.23 08:52, Christopher Wong (christopher.w...@axis.com) wrote: > Hi Lennart, > > I know we are not using the pam_systemd. That is the reason we try > to run the steps manually. It was possible to start the > user@.service in systemd v253, but it fails now with v254 or > later. Well, that's not supported then. You need XDG_RUNTIME_DIR set up properly, and that's what the PAM module gives you. If you turn off the PAM module then you get to keep the pieces, you voided your warranty. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Manual start of user@.service failed with permission denied
On Do, 07.12.23 18:29, Christopher Wong (christopher.w...@axis.com) wrote: > Hi Lennart, > > We are doing the steps to start up a rootless docker. If I don’t set > XDG_RUNTIME_DIR then I will get the below error: > > systemd[1925]: Trying to run as user instance, but $XDG_RUNTIME_DIR > is not set. pam_systemd is responsible for setting this env var. Most likely you are missing that from the PAM stack that is used by this user@.service instance? > The 503 is a system user. So, just to try it out, I created a user, > which got the UID 1001. Using that UID gave me the same result as > the 503. It's a bad idea to run user stuff as system user. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Manual start of user@.service failed with permission denied
On Mi, 06.12.23 14:46, Christopher Wong (christopher.w...@axis.com) wrote: > Hi, > > I’m trying to do the following: > > root@host:~# systemctl set-environment > XDG_RUNTIME_DIR="/run/user/503" Why would you do that? user@.service automatically pulls in user-runtime-dir@.service which is responsible for creating that dir with right perms. is 504 a system user? or a regular user? systemd generally assumes the boundary between system and regular users is between 999 and 1000. But user@.service is really just for regular users, not system users, hence my question. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] How to debug systemd-pcrphase-initrd.service failure
On Mi, 06.12.23 18:28, Renjaya Raga Zenta (ragaze...@gmail.com) wrote: > Hi, > > I am exploring OS image building with mkosi. It works great until I add TPM > 2.0 in qemu. > > I found that the systemd-pcrphase-initrd.service failed. There are 3 > pcrphase service: > > 1. systemd-pcrphase-initrd.service (failed) > 2. systemd-pcrphase.service (ok) > 3. systemd-pcrphase-sysinit.service (ok) So the latter two run from the host fs, the first one from the initrd fs. > Related journal log: > systemd[1]: Failed to start systemd-pcrphase-initrd.service - TPM2 PCR > Barrier (initrd). > ... > systemd-pcrphase[130]: Failed to load TPM2 libraries: Operation not > supported > ... It appears you are lacking the tpm2-tss libraries in your initrd image. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] how to use systemd-sysext addons and systemd-stub to extend an UKI initrd
On Mo, 04.12.23 17:40, Emanuele Giuseppe Esposito (eespo...@redhat.com) wrote: > Hello everyone, > > As the title suggests, I am trying to extend an UKI initrd via > systemd-sysext addons/extensions. > > I contributed to the systemd-stub UKI addons to extend the kernel > command line, so I know how they works and planning to give a talk about > them soon. However, I would like to get the full picture by using the > same mechanism but with systemd-sysext addons to extend also initrd. > > As I understood, a systemd-sysext addon in > /boot/efi/EFI/Linux/.efi.extra.d will be put in /.extra/sysext > by systemd-stub, and then will be picked up by systemd-sysext to be > added into the initrd. > > I am using Fedora, I created my UKI devel.efi, and made sure (just for > safety) that the initrd contains the systemd-sysext module, as I > generated it with dracut. > > The UKI is created with freshly compiled systemd-stub from commit > 5808300c44. Kernel is 6.6.0-0.rc1.20230915git9fdfb15a3dbf.17.fc40.x86_64 > > Then, I created a super dumb extension and put it in the right location: > mkdir extension > cd extension/ > vi ciao.txt > mkdir usr > cp ciao.txt usr/ciao2.txt > cat /etc/os-release > mkdir -p usr/lib/extension-release.d/ > echo ID=fedora > usr/lib/extension-release.d/extension-release.extension > echo VERSION_ID=40 >> > usr/lib/extension-release.d/extension-release.extension > cat usr/lib/extension-release.d/extension-release.extension > cd .. > mksquashfs extension extension.raw > mv extension.raw /boot/efi/EFI/Linux/devel.efi.extra.d/ The image must come with verity + signature, we'll not allow unsigned extensions by default. (you could relax the image policy if you want, or disable it but I'd advise you not to. The env var SYSTEMD_DISSECT_VERITY_SIGNATURE=0 tells sysext to not validate images) With upcoming systemd v255 just use "systemd-repart --make-ddi=sysext" to generate a sysext image with verity and signing. mkosi can help you too. You either need to install your signature public key in the kernel's own keychain somehow, or drop suitable certficates into {/etc,/run}/verity.d/*.crt. The latter is a bit underdocumented. (There was hope we could drop this again because it would become easier to install stuff into the kernel keychain, but that's still a mess, hence this userspace validation is probably going to stay for good). Ultimately if distros ship this in final products they really should use the kernel keyring for this. That's how MSFT uses this for example. > Supposing I manage to do all of the above, my next question would be > how/if to override the /lib folder instead of the traditional /usr or > /opt, as for example I might want to add another kernel module into > the UKI. /lib/ is 1990's Linux. On modern distros, such as Fedora it has long been replaced by a symlink to /usr/lib/. Hence if you want to drop stuff into /lib/ then just drop it into /usr/lib/ instead. > Last but not least is where is the documentation for this. I couldn't > find anything at all about systemd-sysext, and therefore I would be very > very happy to write (other than presenting it) some doc to make the life > easier to anyone like me that is looking forward to using these new > features. So there's the man page of systemd-sysext and systemd-repart. Flatcar has some docs: https://www.flatcar.org/docs/latest/provisioning/sysext/ There is a video from ASG how this fits together: https://www.youtube.com/watch?v=XTy3scX6rF4 There's no tutorial how to put this together though. Contributing that would be very welcome of course! Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Where to install UKI cmdline addons in the root partition
On Mo, 04.12.23 17:48, Emanuele Giuseppe Esposito (eespo...@redhat.com) wrote: > Hello everyone, > > Sorry for the back-to-back emails, but I realized I could use this > mailing list to bring up another topic related to UKI addons. > > This is the same as I wrote in > https://github.com/systemd/systemd/issues/29372 : I think we need some > agreement to decide that if distros want to ship rpms containing default > signed UKI addons, they should all go in the same place in the root > partition. > > By putting them there, we offer the user the possibility to keep the ESP > clean and lightweight (as there is not much space available in there > IIRC), and the user can simply cp the addons from the root partition > into the desired ESP to install the addon, and rm to remove them. > > But I still think it is important to have some agreement, and document > it somewhere. > > What do you think? I commented on the github issue. At this time I think more people are subscribed to that than watch this ML. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Configure netdev RPS using systemd-networkd
On Mo, 04.12.23 14:59, Renjaya Raga Zenta (ragaze...@gmail.com) wrote: > Hi, > > We want to implement our networking using systemd-networkd. We think > systemd is stable enough right now, so we want to try more "systemd-only" > solution. > > In our environment, we use RPS (Receive Packet Steering) for load balancing > and scaling. It's a kernel feature implemented a long time ago. You could > visit the documentation at > https://www.kernel.org/doc/html/latest/networking/scaling.html. > > Currently, we manually do this after network interface is configured: > > echo f > /sys/class/net/eth0/queues/rx-0/rps_cpus > > where f is bitmap mask , it means to utilize 4 cpus. > > Will this use case be implemented in systemd-networkd? Or should we use a > third party solution such as networkd-broker or networkd-dispatcher? I see no reason why we wouldn't add a high-level option for this to .link files. We are happy to review/merge a patch. Please submit via GitHub. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] systemd: questions about dbus dependency service
On Mo, 04.12.23 13:01, Pintu Agarwal (pintu.p...@gmail.com) wrote: > Hi, > Any comments or suggestions on the below ? I already replied. https://lists.freedesktop.org/archives/systemd-devel/2023-November/049706.html Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Systemd-nspawn single process
On Fr, 01.12.23 14:03, Warex61 YTB (thomasdabou...@gmail.com) wrote: > Hello, > I would like to use systemd-nspawn to create a container that can launch a > single process as pid 1 and mount its configuration files. I want the > container to be as light as possible. Is there any way of creating a > container using nspawn without using bootstrap ? > > For example, using this command, without using a bootstrap > > systemd-nspawn -M process -D /etc/systemd/nspawn/process > /etc/systemd/nspawn/process.nspawn > I get the following error > > Directory /etc/systemd/nspawn/process doesn't look like it has an OS tree. > Refusing. > What are the conditions for nspawn to consider an OS tree in > /etc/systemd/nspawn/process ? You are using an ancient version of nspawn. Since 2y or so the message reads: Directory %s doesn't look like it has an OS tree (/usr/ directory is missing). Refusing. And that's your explanation: you need an /usr/ directory. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] How to properly wait for udev?
On Mo, 27.11.23 21:32, Richard Weinberger (richard.weinber...@gmail.com) wrote: > On Mon, Nov 27, 2023 at 9:29 AM Lennart Poettering > wrote: > > If they conceptually should be considered block device equivalents, we > > might want to extend the udev logic to such UBI devices too. Patches > > welcome. > > Why doesn't udev flock() every device it is probing? > Or asked differently, why is this feature opt-in instead of opt-out? Some software really doesn't like it if we take BSD locks on their devices, hence we don't take it blanket everywhere. And what's more important even: for various devices it simply isn't safe to just willy-nilly even open them (tape drivers and things, which might start to pull in a tape if we do). For others we might not be able to even open thing at all with the wrong flags (for example, because they are output only). Bock devices have relatively well defined semantics, there it's generally safe to do this, hence we do. Hence, it might be safe for UBI, but for the general case it might not be. That said, would BSD locking even address your issue? If you devices are exclusive access things and we first open() them and then flock() them, then that's not atomic. So if your test cases open the devices, then flock() them you might still get into conflict udev because it just open()ed the device, but didn#t get to call flock() yet. Doesn't UBI have something like O_EXCL-behaviour that grants true exclusive access? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] systemd: questions about dbus dependency service
On Di, 28.11.23 22:48, Pintu Agarwal (pintu.p...@gmail.com) wrote: > Hi, > > I need some clarification about systemd services that are dependent on dbus > service. > > We have a service that depends on dbus.service, so our service has to be > started after dbus.socket and dbus.service. It's usually a good idea to not wait for dbus.sevice. Waiting for dbus.socket is sufficient, it makes sure clients can connect to D-Bus (even if dbus needs to finish starting up to respond to it). This will increase parallelization during boot. > But dbus.service comes after local-fs.target and sysinit.target. > However, our service needs to be started very early on boot-up, maybe > within local-fs target itself, otherwise it is causing regression in our > boot KPI. dbus is not a suitable IPC for early boot services, unless you speak the dbus protocol directly between client and service, without involving the broker. But that's messy. systemd's PID 1 does this (i.e. dbus without a broker), because it must be accessible early on, but I hate that code, and I'd rather kill it. In new code that must run in early boot we usually use a different IPC (varlink), that does not involve any broker, and thus always works. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] How to properly wait for udev?
On So, 26.11.23 00:39, Richard Weinberger (richard.weinber...@gmail.com) wrote: > Hello! > > After upgrading my main test worker to a recent distribution, the UBI > test suite [0] fails at various places with -EBUSY. > The reason is that these tests create and remove UBI volumes rapidly. > A typical test sequence is as follows: > 1. creation of /dev/ubi0_0 > 2. some exclusive operation, such as atomic update or volume resize on > /dev/ubi0_0 > 3. removal of /dev/ubi0_0 > > Both steps 2 and 3 can fail with -EBUSY because the udev worker still > holds a file descriptor to /dev/ubi0_0. Hmm, I have no experience with UBI, but are you sure we open that? why would we? are such devices analyzed by blkid? We generally don't open device nodes unless we have a reason to, such as doing blkid on it or so. What precisely fails for you? the open()? or some operation on the opened fd? > > FWIW, the problem can also get triggered using UBI's shell utilities > if the system is fast enough, e.g. > # ubimkvol -N testv -S 50 -n 0 /dev/ubi0 && ubirmvol -n 0 /dev/ubi0 > Volume ID 0, size 50 LEBs (793600 bytes, 775.0 KiB), LEB size 15872 > bytes (15.5 KiB), dynamic, name "testv", alignment 1 > ubirmvol: error!: cannot UBI remove volume > error 16 (Device or resource busy) > > Instead of adding a retry loop around -EBUSY, I believe the best > solution is to add code to wait for udev. > For example, having a udev barrier in ubi_mkvol() and ubi_rmvol() [1] > seems like a good idea to me. For block devices we implement this: https://systemd.io/BLOCK_DEVICE_LOCKING I understand UBI aren't block devices though? If they conceptually should be considered block device equivalents, we might want to extend the udev logic to such UBI devices too. Patches welcome. We provide "udevadm lock" to lock a block device according to this scheme from shell scripts. > What function from libsystemd do you suggest for waiting until udev is > done with rule processing? > My naive approach, using udev_queue_is_empty() and > sd_device_get_is_initialized(), does not resolve all failures so far. > Firstly, udev_queue_is_empty() doesn't seem to be exported by > libsystemd. I have open-coded it as: > static int udev_queue_is_empty(void) { >return access("/run/udev/queue", F_OK) < 0 ? >(errno == ENOENT ? true : -errno) : false; > } This doesn't really work. udev might still process the device in the background. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Does coredumpctl info support minidebuginfo / gnu_debugdata ?
On Do, 16.11.23 18:37, Etienne Cordonnier (ecordonn...@snap.com) wrote: > Hello, > I am testing a yocto based system, where it seems that "coredumpctl info" > isn't able to use minidebuginfo / gnu_debugdata to extract a symbolized > call-stack. I saw in the code that coredumpctl uses elfutils / libdwfl in > order to extract a call-stack, and as far as I understand libdwfl supports > minidebuginfo since this commit ( > https://sourceware.org/git/?p=elfutils.git;a=commit;h=5083a70d3b64946fa47ea5766943a15a3ecc6891 > ). > > Is there a configuration / build-option / etc. to enable support for > minidebuginfo in coredumpctl? If no is it on the roadmap? The advantage of > minidebuginfo is that it is much smaller than full debug symbols. Fedora has been using minidebuginfo since ~10y or so, and coredumctl/libdwfl has been working fine with it. So it certainly works, it's how this all works on my local machine since forever. Maybe ask your distro for help, it's generally an integration issue of distributions i this doesn't work. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Low memory dbus signal for GMemoryMonitor
On Di, 14.11.23 15:00, Kate Hsuan (h...@redhat.com) wrote: > Hi Folks, Hi! > Could systemd detect the system's low memory status and send a signal > through Dbus about low memory events? We already have an interface for this, it's documented here: https://systemd.io/MEMORY_PRESSURE It doesn't operate via D-Bus however, but instead just tells apps how to directly get the events from the kernel. That's generally better than bumping the events off two daemons (i.e. a memory pressure daemon and a dbus broker), simply because memory pressure is a problem of latency, and you should not add additional steps to the notifications if you want to make things better and not worse. Moreover, on memory pressure you shouldn't allocate more memory, which is something the indirection through a daemon and broker would typically mean. > We are looking for a new backend for GMemoryMonitor. > https://developer-old.gnome.org/gio/stable/GMemoryMonitor.html > > The original backend- low-memory-monitor monitors the system memory > usage. When it detects the memory is lower than a level, it signals > the application. It also manages the kernel OOM. It should be possibly to implement a GMemoryMonitor on top of the kernel APIs directly, using the information systemd gives you. See the documentation. It even briefly mentions GMemoryMonitor at the end. If you have any questions about details, feel free to ask! Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Help! iSCSI based file systems with "_netdev" causing ordering cycles to occur (random services and mounts fail)
On Mo, 30.10.23 10:17, Lennart Poettering (lenn...@poettering.net) wrote: > On Fr, 27.10.23 20:46, Tony Rodriguez (unixpro1...@gmail.com) wrote: > > > Andrea asked for more details so I have provide this verbose output. > > > > 1) Lennart's recommendation of removing "/tmp" within /etc/fstab and using > > tmpfs for "/tmp" appears to stop the dependency issue for systemd-239 for > > systemd-252. However, RH8 and RH9 don't support systemd-networkd, I am > > wondering how this can be overcome if removing "/tmp" and using "tmpfs" > > aren't options? Would I have to modify various services and targets? What > > would I need to add or remove within services and targets to avoid these > > dependencies? > > This is something you'd have to ask your OS vendor. If they don't > support netwokrd, they will support something else, and maybe it has a > way to run in early boot or initrd. > > Booting without /tmp/ mounted during early boot is simply not > supported from upstream, sorry, can't help you there. Please contact > your OS vendor if they can help you. > > > 2) On another note, with RH9 systemd-252-14/systemd-252-18 and iscsi, new > > dependency issues occur if "_netdev" within /etc/fstab is specified for > > "/var" or "/usr". > > If /usr/ is split off it *must* be mounted even earlier than /tmp/: it > must be mounted in the initrd, nothing else is supported, sorry. > > If /var/ is split off it must be mounted at the same point as /tmp/, > i.e some time in early boot, not necessarily in the initrd though. Since we never documented this properly I have put together another piece of documentation that summarizes the requirements on mounts and when they must be available during boot: https://github.com/systemd/systemd/pull/29761 You can see the rendered version here (until the next PR push that is) https://github.com/systemd/systemd/blob/87828aae4712bdb300101b05911392c43d081a6b/docs/MOUNT_REQUIREMENTS.md Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Help! iSCSI based file systems with "_netdev" causing ordering cycles to occur (random services and mounts fail)
On Fr, 27.10.23 20:46, Tony Rodriguez (unixpro1...@gmail.com) wrote: > Andrea asked for more details so I have provide this verbose output. > > 1) Lennart's recommendation of removing "/tmp" within /etc/fstab and using > tmpfs for "/tmp" appears to stop the dependency issue for systemd-239 for > systemd-252. However, RH8 and RH9 don't support systemd-networkd, I am > wondering how this can be overcome if removing "/tmp" and using "tmpfs" > aren't options? Would I have to modify various services and targets? What > would I need to add or remove within services and targets to avoid these > dependencies? This is something you'd have to ask your OS vendor. If they don't support netwokrd, they will support something else, and maybe it has a way to run in early boot or initrd. Booting without /tmp/ mounted during early boot is simply not supported from upstream, sorry, can't help you there. Please contact your OS vendor if they can help you. > 2) On another note, with RH9 systemd-252-14/systemd-252-18 and iscsi, new > dependency issues occur if "_netdev" within /etc/fstab is specified for > "/var" or "/usr". If /usr/ is split off it *must* be mounted even earlier than /tmp/: it must be mounted in the initrd, nothing else is supported, sorry. If /var/ is split off it must be mounted at the same point as /tmp/, i.e some time in early boot, not necessarily in the initrd though. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Help! iSCSI based file systems with "_netdev" causing ordering cycles to occur (random services and mounts fail)
On Do, 26.10.23 19:03, Tony Rodriguez (unixpro1...@gmail.com) wrote: > Experiencing this same issue with iSCSI and systemd-239 for RH8/Rocky8 and > RH9/Rocky9 system-252. Nothing was done on my end to create this issue. In > other words, no custom mount/unit files or services, just your typical ISO > install and rpm updates. > > An ordering cycle occurs, when "_netdev" is specified within /etc/fstab for > systemd. This happens with systemd-239-14 and systemd-239-18 using iSCSI > based file systems. Seems others are experiencing this as well (see link > below). I can also confirm this happens with systemd-252 (RH9/Rocky9)l. > Especially if "_netdev" is used with either "/var" or "/usr" iSCSI based > devices/file systems. The system may not boot, may not mount file systems, > may not start services/unit files, and the system becomes slow during system > boot. > > Does anyone know of a fix/patch and root cause for this? > > Please see this link: > https://issues.redhat.com/browse/RHEL-12987?jql=project%20%3D%20RHEL%20AND%20affectedVersion%20%3D%20rhel-9.2.0%20AND%20text%20~%20%22iscsi%22 > > # cat /etc/fstab > [...] > /dev/mapper/rhel-root / xfs defaults,_netdev 0 0 > UUID=2177a7fc-bc41-43e4-bdc1-d231a5eb4680 /boot xfs defaults,_netdev 0 0 > /dev/mapper/rhel-tmp /tmp xfs defaults,_netdev 0 0 > /dev/mapper/rhel-var /var xfs > defaults,_netdev,x-initrd.mount 0 0 > /dev/mapper/rhel-var_log /var/log xfs defaults,_netdev 0 0 > /dev/mapper/rhel-var_tmp /var/tmp xfs defaults,_netdev 0 0 > > # journalctl -b | grep deleted > Oct 13 08:15:35 vm-isci8 systemd[1]: basic.target: Job tmp.mount/start > deleted to break ordering cycle starting with basic.target/start > Oct 13 08:15:35 vm-isci8 systemd[1]: network.target: Job > network-pre.target/start deleted to break ordering cycle starting with > network.target/start > Oct 13 08:15:35 vm-isci8 systemd[1]: NetworkManager.service: Job > dbus.socket/start deleted to break ordering cycle starting with > NetworkManager.service/start /tmp must be available during early boot already, and your NetworkManager service is apparently a late boot service. Hence you have a cycle: you want that /tmp/ is mounted after the network, but your network is configured really late. But /tmp is necessary during early boot. BOOM! Two ways out: 1. Don't make /tmp an iscsi mount. Bad idea anyway. Just use tmpfs for it, like everyone else. 2. Upgrade to a better network management solution that has no problems with running in early boot, for example systemd-networkd. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] How to use systemd-growfs* services with GPT automount
On Di, 24.10.23 23:48, Nils Kattenbeck (nilskem...@gmail.com) wrote: > > On Mo, 23.10.23 02:00, Nils Kattenbeck (nilskem...@gmail.com) wrote: > > > > > Hello, > > > > > > I am not sure how to get systemd-growfs-root.service to work with > > > automount. The partitions are configured via systemd-repart (and the > > > image created using mkosi). While the partitions are correctly grown > > > upon boot, the contained filesystem is not grown to match the > > > partition even though GrowFileSystem defaults to true. Is there > > > anything I am missing or an easy way to troubleshoot this and get more > > > information? > > > > > > One thing I notice is that the generator.late/-.mount unit has a > > > Options=ro which as per documentation prevents growing the filesystem. > > > However, the filesystem is actually mounted read-write so I assume > > > this is just an artifact of the initrd. Is it not possible to grow the > > > filesystem from which the initrd starts? > > > > Do you have "ro" or "rw" on the kernel cmdline? > > I have neither set on the cmdline. if you add it, does it work? ro/rw is a bit weird. Usually in our configuration model the settings on the kernel cmdline args take precedence over config in /etc/. But ro/rw is different for historical reasons: it only specifies the initial ro/rw state of the disks, expecting that /etc/fstab later changes things to the final setting. And if neither are specified we imply "ro". Hence, you have two choices: define an /etc/fstab (which of course is not what you want with gpt-auto) or just add "rw" to the kernel cmdline. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] How to get Credential into Environment variable?
On Di, 26.09.23 04:39, chandler (s...@riseup.net) wrote: > Hi all, > > I'm not quite grasping something here... I've just learned about > `systemd-creds` and now trying to utilize it with a service which > depends on a secret stored in an environment variable (or passed as a > CLI option). > > Normally I could use a line like: > > `Environment=SEC=1234` > > Now I've: > > 1) Given "1234" to `systemd-ask-password -n | systemd-creds encrypt > --name=secret --pretty - -` > 2) Put the resulting `SetCredentialEncrypted=secret: ...` under the > [Service] section > 3) Failing with `Environment=SEC=%d/secret` > > Now `SEC=/run/credentials/myService.service/secret` but I need the value > from the file, which I verified with a simple `ExecStart=checkEnv.sh` > which runs `cat ${SEC}` which prints `1234`. > > Also tried putting the secret, e.g. "1234", into a temp file `/tmp/sec` > and ran: > > `systemd-creds encrypt --name=secret --pretty /tmp/sec -` > > but the results are the same. > > How to get `SEC=1234` basically? The credentials logic is supposed to be used *in* *place* of environment variables. Environment variables are an awful way to pass credentials to services, since their are inherited down the process tree even across security boundaries by default, and there's zero access control on them. (and they are not really compatible with binary data or larger data, and so on) Hence, what you are looking for is not supported and we won't support it, because it defeats one main design goal of credentials: to require access control on access, and not allow "greedy" inheritance down the process tree. Sorry if that's disappointing! If a service insists on reading its credentials from an env var or cmdline and supports nothing else this is of course disappointing, but it's simply not compatible with the credentials logic, without manual glue scripting. We generally recommend that services support reading the credentials from files (in which case you can point them to $CREDENTIALS_DIRECTORY/ to read what they want), or even better: just natively support credentials by looking at $CREDENTIALS_DIRECTORY on their own, without being told so. If you have an app that doesn't allow either, and really and only wants env vars or cmdline params, then you can script around this, with a script like this: ```c #!/bin/bash read -r MYCRED < "$CREDENTIALS_DIRECTORY"/mycred export MYCRED exec mybinary ``` you get the idea. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] How to use systemd-growfs* services with GPT automount
On Mo, 23.10.23 02:00, Nils Kattenbeck (nilskem...@gmail.com) wrote: > Hello, > > I am not sure how to get systemd-growfs-root.service to work with > automount. The partitions are configured via systemd-repart (and the > image created using mkosi). While the partitions are correctly grown > upon boot, the contained filesystem is not grown to match the > partition even though GrowFileSystem defaults to true. Is there > anything I am missing or an easy way to troubleshoot this and get more > information? > > One thing I notice is that the generator.late/-.mount unit has a > Options=ro which as per documentation prevents growing the filesystem. > However, the filesystem is actually mounted read-write so I assume > this is just an artifact of the initrd. Is it not possible to grow the > filesystem from which the initrd starts? Do you have "ro" or "rw" on the kernel cmdline? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Support for loading Multiple DTBs from UKI image
On Mi, 11.10.23 10:00, VENKAT Nagaraj THOGARU (QUIC) (quic_thog...@quicinc.com) wrote: > Hi > @systemd-devel@lists.freedesktop.org<mailto:systemd-devel@lists.freedesktop.org>, > > > Is there any support for parsing Multiple DTBs and selecting appropriate DTB > from UKI image in system-boot? > Or is there any UEFI interface hook to implement such a change in UEFI to > make a selection of DTB, just like DT_FIXUP ? There's a PR for this: https://github.com/systemd/systemd/pull/28959 But it hasn't seen progress in the past 3 weeks. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Help! Reached target Local File Systems order is incorrect
On Mo, 09.10.23 12:07, Tony Rodriguez (unixpro1...@gmail.com) wrote: > Created a service that invokes a "systemctl daemon-reload". Goal is for a > reload to occur early in the boot process, before other user made services > are invoked. During additional testing, sometimes it is correct and other > times it is out of order (incorrect - See steps C). It may work for 5 or 6 > times after each reboot/shutdown, then randomly become incorrect. How can I > make this more consistent? Already tried various keyword combinations > (wants,before,after, etc) within the unit file, all without luck. > Thought about something like "After=var.mount, etc" as well, but that is > inflexible because I will not know filesystems users may create. > > A) Unit file > > [Unit] > Description=Systemctl-Reload > Wants=local-fs.target > DefaultDependencies=yes > > [Service] > Type=oneshot > RemainAfterExit=yes > ExecStart=/bin/systemctl daemon-reload > > [Install] > WantedBy=local-fs.target > > B) Correct order: ** Reached target Local File Systems is after all > mounting is done. Sometimes it works. You have not defined any order in the unit file. i.e. not After= nor Before=. Hence it's going to be executed as quickly as possible during boot. See docs: https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Before= Generally though it's recommended not to reload PID 1 configuration during the initial transaction if avoidable. Better approaches are to put together generators or so, which can augment the set of units and their dependencies already when the first transaction is put together. https://www.freedesktop.org/software/systemd/man/systemd.generator.html Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Systemd cgroup setup issue in containers
On Fr, 29.09.23 10:53, Lewis Gaul (lewis.g...@gmail.com) wrote: > Hi systemd team, > > I've encountered an issue when running systemd inside a container using > cgroups v2, where if a container exec process is created at the wrong > moment during early startup then systemd will fail to move all processes > into a child cgroup, and therefore fail to enable controllers due to the > "no internal processes" rule introduced in cgroups v2. In other words, a > systemd container is started and very soon after a process is created via > e.g. 'podman exec systemd-ctr cmd', where the exec process is placed in the > container's namespaces (although not a child of the container's PID > 1). Yeah, joining into a container is really weird, it makes a process appear from nowhere, possibly blocking resources, outside of the resource or lifecycle control of the code in the container, outside of any security restrictions and so on. I personally think joining a container via joining the namespaces (i.e. podman exec) might be OK for debugging, but it's not a good default workflow. Unfortunately the problems with the approach are not well understood by the container people. In systemd's own container logic (i.e. systemd-nspawn + machinectl) we hence avoid doing anything like this. "machinectl shell" and related commands will instead talk to PID 1 in the container and ask it to spawn something off, rather than doing so yourself. Kinda related to this: util-linux' "unshare" tool (which can be used to generically enter a container like this) also is pretty broken in this regard btw, and I asked them to fix that, but nothing happened there yet: https://github.com/util-linux/util-linux/issues/2006 I'd advise "podman" and these things to never place joined processes in the root cgroup of the container if they delegate cgroup access to the container, because that really defeats the point. Instead they should always join the cgroup of PID 1 in the container (which they might already do I think), and if PID 1 is in the root cgroup, then they should create their own subcgroup "/joined" or so, and put the process in there, to not collide with the "no processes in inner groups" rule of cgroupv2. > This is not a totally crazy thing to be doing - this was hit when testing a > systemd container, using a container exec "probe" to check when the > container is ready. > > More precisely, the problem manifests as follows (in > https://github.com/systemd/systemd/blob/081c50ed3cc081278d15c03ea54487bd5bebc812/src/core/cgroup.c#L3676 > ): > - Container exec processes are placed in the container's root cgroup by > default, but if this fails (due to the "no internal processes" rule) then > container PID 1's cgroup is used (see > https://github.com/opencontainers/runc/issues/2356). This is a really bad idea. At the very least the rule should be reversed (which would still be racy, but certainly better). But as mentioned they should never put something in the root cgroup if cgroup delegation is on. > - At systemd startup, systemd tries to create the init.scope cgroup and > move all processes into it. > - If a container exec process is created after finding procs to move and > moving them but before enabling controllers then the exec process will be > placed in the root cgroup. > - When systemd then tries to enable controllers via subtree_control in the > container's root cgroup, this fails because the exec process is in that > cgroup. > > The root of the problem here is that moving processes out of a cgroup and > enabling controllers (such that new processes cannot be created there) is > not an atomic operation, meaning there's a window where a new process can > get in the way. One possible solution/workaround in systemd would be to > retry under this condition. Or perhaps this should be considered a bug in > the container runtimes? Yes, that's what I think. They should fix that. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Is systemd-cryptsetup binary internal?
On Mo, 18.09.23 17:47, Nils Kattenbeck (nilskem...@gmail.com) wrote: > Hi, > > /usr/lib/systemd/ is indeed the place for internal binaries with > > unstable interfaces. But it's also the place where we put binaries > > that we don't typically expect users to call, because they are > > generally called via some well define .service unit or so only. > > > > systemd-cryptsetup is one of the latter, we'd expect people to use > > this via crypttab mostly. However, the interface is nonetheless > > stable, it is a long-time part of systemd and so far we never broke > > interface and I see no reason we ever would. In fact it might be a > > candidate to move over to /usr/bin to make official, if there's > > sufficient request for that. (such a request should be made via github > > issue tracker) > > > > Why was the decision taken to put these into /usr/lib/systemd instead of > /usr/libexec/systemd/? That's a Fedoraism. Why would one put something there? /usr/lib/ is where private arch-dependent package stuff goes. What's the rationale for /usr/libexec/ though? Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Is systemd-cryptsetup binary internal?
On Mo, 18.09.23 15:22, mpan (systemdml-bfok4...@mpan.pl) wrote: > Hello, > > I got redirected to here from #systemd on Libera. While responding to a > query from another person (not on #systemd), I came across an ambiguity. Any > answer I give, its validity would be uncertain. I wish to receive an > authoritative clarification. > > There is systemd-cryptsetup binary in “/usr/lib/systemd/”. Its location > suggests it’s internal to systemd and not intended for user invocation. > However, it is also listed in manual as if it was something the user might > be concerned with. The manual even has a specific, separate, explicit > reference to systemd-cryptsetup page — though it’s shared with the > corresponding service and the binary itself isn’t described. /usr/lib/systemd/ is indeed the place for internal binaries with unstable interfaces. But it's also the place where we put binaries that we don't typically expect users to call, because they are generally called via some well define .service unit or so only. systemd-cryptsetup is one of the latter, we'd expect people to use this via crypttab mostly. However, the interface is nonetheless stable, it is a long-time part of systemd and so far we never broke interface and I see no reason we ever would. In fact it might be a candidate to move over to /usr/bin to make official, if there's sufficient request for that. (such a request should be made via github issue tracker) > Thanks in advance for indicating, if systemd-cryptsetup (the binary) is a > tool users may rely on. Yes, absolutely. The only reason when we might break things for you is when we one day move it from /usr/lib to /usr/bin, ;-) Hence: the call interface is certainly stable, the location in that sense maybe not yet. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] DynamicUser=yes leads to "Too many levels of symbolic links" for /etc/.pwd.lock
On Do, 14.09.23 03:50, Muggeridge, Matt (matt.muggerid...@hpe.com) wrote: > $ ls -l /etc/.pwd.lock > > lrwxrwxrwx 1 root root 19 Apr 5 2011 /etc/.pwd.lock -> sysconfig/.pwd.lock > > $ ls -l /etc/sysconfig/.pwd.lock > > -rw--- 1 root root 0 Aug 16 07:25 /etc/sysconfig/.pwd.lock > > For the purpose of investigation, I configured an overlay so /etc/.pwd.lock > was a simple writeable file (not a read-only symlink) and the service starts. > > Why is systemd complaining about the file being a symlink? It's supposed to be a lock file, i.e. a regular file we issue POSIX file locks on. It's not a config file. The problem with symlinks for things like this is that in various contexts these files are atomically replaced, and if that happens then symlinks just make a mess, since it's not clear whether to replace the symlink or its target. Hence, we don't support that. Generally, things like /etc/passwd is API pretty much, you cannot really change it to a be a symlink (unless you make it fully immutable), since it is updated by various tools and these tools tend to do atomic updates of these files, i.e. when updating they write a new file under a temporary name/O_TMPFILE, and then atomically move it over the old file, so that other clients either get the old version or the new version but never a half-updated version. This kind of updating is really how you have to do things on UNIX, but that means symlinks are out of the question... Hence, TLDR: don't make the lock file a symlink. (Also, why would you even?) Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Fedora 38 and signed PCR binding
On Mo, 11.09.23 14:48, Aleksandar Kostadinov (akost...@redhat.com) wrote: > Hi again. I tried to boot from UKI to no avail. > > First created a "db" certificate > > openssl req -newkey rsa:2048 -nodes -keyout db_arch.key -new -x509 -sha256 > > -days 3650 -subj "/CN=My DB cert/" -out db.pem > > openssl x509 -outform DER -in db.pem -out db.crt > > Then uploaded it to secure boot trust VIA USB drive and the UEFI seup. > > Then created UKI: > > /usr/lib/systemd/ukify \ > > /lib/modules/6.4.12-200.fc38.x86_64/vmlinuz \ > > /boot/initramfs-6.4.12-200.fc38.x86_64.img \ > > --pcr-private-key=/etc/systemd/tpm2-pcr-private-key.pem \ > > --pcr-public-key=/etc/systemd/tpm2-pcr-public-key.pem \ > > --phases='enter-initrd' \ > > --pcr-banks=sha1,sha256 \ > > --secureboot-private-key=/etc/secure_boot/db.key \ > > --secureboot-certificate=/etc/secure_boot/db.pem \ > > --sign-kernel \ > > --cmdline='ro rhgb' > > Then added a boot entry: > > efibootmgr -c -d /dev/sda -p 1 -l /EFI/FEDORA/UKI/VMLINUZ612.EFI -L "Fedora > > UKI" > > Unfortunately when trying to boot this I get: > > Bad kernel image: Load Error That suggests the kernel you picked does not carry a correct PE/MZ signature. i.e. we generate that error typically if we can#t jump into it because it doesn't come with the right PE headers. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] systemd-repart /etc automount via discoverable partition specification
On Mo, 11.09.23 11:39, Nils Kattenbeck (nilskem...@gmail.com) wrote: > On Mon, Sep 11, 2023, 10:54 Lennart Poettering > wrote: > > > On So, 10.09.23 00:33, Nils Kattenbeck (nilskem...@gmail.com) wrote: > > > > > Hello, I am currently trying to build a linux image with discoverable > > > partitions in an A/B+etc+var scheme. > > > > The discoverable partition scheme has no concept of /etc/ discovery. It > > focusses on three basic setups: > > > > 1. writable root fs that contains /etc/, /var/ and /usr/ directly. > > 2. writable root fs that contains /etc/ and /var/ and gets an > >immutable /usr/ mounted in > > 3. immutable root fs that contains /etc/ and /usr/ directly and gets a > >writable /var/ mounted in. (the latter possibly as tmpfs, for truly > >stateless systems) > > There is also 4. with a writeable root which only contains /etc, an > immutable /usr and a temporary /var. Though I guess that can be covered > with the existing DPS...? That's pretty much the same as 2, except that /var is overmounted with a tmpfs. i.e. you would simply place /etc/fstab in there, that says /var is tmpfs. > > It was our assumption that these three cases should cover most > > intended behaviours nicely, i.e. systems with modifiable config, code > > and state. systems with modifiable config and state, but immutable > > code. And finally systems with immutable config and code, but > > modifiable state. > > > > A system where /etc/ was separate from the root fs is not covered by > > the above, because it is not clear what that would get us. if you want > > it immutable, why not stick it on an immutable root fs. And if you > > want it writable, why not stick it on a writable root fs directly? > > My use case is basically 2, /etc has to be writeable to persist the > machine-id across reboots, /var also has to be writeable and /usr can be > immutable. > > The problem I am then likely facing is that I create the partitions wrong. > I am using mkosi and tried several different repart.d configuration with > type=root+type=usr, type=root+type=var+type=use, and different CopyFiles= > and Exclude(Target)Files= but none of them seemed to have worked. if your /var/ is supposed to be a tmpfs, then don't mention it to mkosi/repart, just put an /etc/fstab into place that dicates /var is mounted as tmpfs. Other than that you should just be able to use Type=root and Type=usr then. > Are there special requirements for what the respective partitions must or > shall not contain when using several auto-discovered partitions? Or should > I ask on the mkosi issue tracker? If you have just root + usr then this should be a pretty common situation for mkosi, it's not special and should just work. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] systemd-repart /etc automount via discoverable partition specification
On So, 10.09.23 00:33, Nils Kattenbeck (nilskem...@gmail.com) wrote: > Hello, I am currently trying to build a linux image with discoverable > partitions in an A/B+etc+var scheme. The discoverable partition scheme has no concept of /etc/ discovery. It focusses on three basic setups: 1. writable root fs that contains /etc/, /var/ and /usr/ directly. 2. writable root fs that contains /etc/ and /var/ and gets an immutable /usr/ mounted in 3. immutable root fs that contains /etc/ and /usr/ directly and gets a writable /var/ mounted in. (the latter possibly as tmpfs, for truly stateless systems) It was out assumption that these three cases should cover most intended behaviours nicely, i.e. systems with modifiable config, code and state. systems with modifiable config and state, but immutable code. And finally systems with immutable config and code, but modifiable state. A system where /etc/ was separate from the root fs is not covered by the above, because it is not clear what that would get us. if you want it immutable, why not stick it on an immutable root fs. And if you want it writable, why not stick it on a writable root fs directly? The design of saying "/etc/ is always part of the rootfs" is also reflecting the fact that /etc/fstab is the map of secondary file systems to mount, i.e. it generally contains references to other file systems that take precedence over the discoverable partition spec, and hence it is crucial that we place it on the first item in the chain so that we can take it into account before looking for other items in the chain. > I know that /usr and /var have a > corresponding partition UUID for automatically mounting them as per > DPS. However, I am not sure how to mount the /etc partition? Do I have > to specify it as the root partition and exclude /usr and /var in it? > Any help would be appreciated. If you want /etc/ split off, then the discoverable partition spec won't help you: you have to mount it explicitly from your initrd. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Documentation question about sd-device
On So, 10.09.23 11:45, CARLOS VILLANUEVA Y SIMON (carvi...@gmail.com) wrote: > Hello all, > > Recently, while updating a code that was using libudev to a more modern > API, sd-device, I could not find some of the functions that are defined at > sd-device.h ( > https://github.com/systemd/systemd/blob/main/src/systemd/sd-device.h) at > the man pages ( > https://www.freedesktop.org/software/systemd/man/sd-device.html), e.g. > functions related with sd_device_monitor among others. > > May I be right or am I missing something? Yeah, the documentation is not entirely complete yet. Sorry! It's such a thankless job! But it's definitely on our TODO list. If you can't guess how things work from the header, let us know, we can provide you here with the necessary info to get things off the ground. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Fedora 38 and signed PCR binding
On Sa, 02.09.23 22:22, Aleksandar Kostadinov (akost...@redhat.com) wrote: > Looking at the PR [1] it looks like I need to do a lot of things at > each update manually. Is the thing in the comment the only thing I > need to do or are there other things as well? There's nowadays "ukify" that does all of this for you in one relatively easy step, it's our recommended approach to building UKIs these days. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Fedora 38 and signed PCR binding
On Sa, 02.09.23 22:18, Aleksandar Kostadinov (akost...@redhat.com) wrote: > Hello, > > Trying to configure Signed PCR binding on Fedora 38 by following > article [1] and adapting commands for signing. > > What I did was basically this: > > openssl genrsa -out /etc/systemd/tpm2-pcr-private-key.pem 2048 > > openssl rsa -in /etc/systemd/tpm2-pcr-private-key.pem -pubout -out > > /etc/systemd/tpm2-pcr-public-key.pem > > sudo systemd-cryptenroll --tpm2-device=auto > > --tpm2-public-key-pcrs=7+9+11+12+13+14+15 /dev/sda3 > > added tpm2-device=auto,tpm2-pcrs=7+9+11+12+13+14+15 > > But automatic unlocking does *not* work. And This is what > systemd-measure returns: > > $ /usr/lib/systemd/systemd-measure status > Warning: current kernel image does not support measuring itself, the > command line or initrd system extension images. > The PCR measurements seen are unlikely to be valid. > # PCR[11] Unified Kernel Image (NOT SET!) > 11:sha1= > 11:sha256= > # PCR[12] Kernel Parameters (NOT SET!) > 12:sha1= > 12:sha256= > # PCR[13] initrd System Extensions (NOT SET!) > 13:sha1= > 13:sha256= > > Did I do something wrong? Is just necessary integration missing from > Fedora 38 so I better revert to normal PCR binding? Is your kernel built with sd-stub glued in fron of it? i.e. did you use ukify? Note that fedora still uses a legacy boot path with grub and traditional kernels, instead of sd-boot/sd-stub and UKIs. PCR measurements are messy there, and the pcr signature stuff as implemented in systemd-measure doesn't work there. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Online backup API for systemd-journal?
On Mo, 04.09.23 16:35, Etienne Doms (etienne.d...@gmail.com) wrote: > Hi, > > I have some embedded systems in the wild, not connected to anything, > on which you can push a button "something went wrong, create a dump". > Then later I can fetch the said dump and inspect it. > > I'd like to include the whole journal, for the current boot, in a > binary format so that I can later do "journalctl --file > path/to/journal-dump.bin" from another machine. I understand that > internally everything is stored in /var/log/journal/, but > I guess that I cannot blindly tar/cp the .journal files, since this > would be racy. That should actually work fine. journald has no locking around journal files: the server that writes to the files and the client that reads them are not synchronized. The client is supposed to handle incomplete writes by simply suppressing display of the trailing, incomplete entries. This is a common code path, that is quite well tested these days. Hence, I should actually be fine to just copy the journal files as they are being written, the tools on the other side will possibly then see a file with records currently "in flight" that are referenced at some places but not others, but that should be totally OK, the tools should handle this, and this i no different from their local access. > So, is there an API to safely dump a big ".journal" file containing a > snapshot of "journalctl -b"? I could not find anything in the > documentation, sorry in advance if I missed something obvious. You can use "-o export" to dumb the files in an "export" format. But this is just about returning the data in a different format, it does not give you any synchronization guarantess since journalctl started that way will just read the data from the journal files unsynchronized as everyeone else too. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] [multiseat] Attach virtual input to seat1
On So, 03.09.23 00:46, LuKaRo (li...@lrose.de) wrote: > > $ sudo loginctl attach seat1 /sys/devices/virtual/input/input43 > Could not attach device: No such device > $ sudo loginctl attach seat1 /sys/devices/virtual/input/input44 > Could not attach device: No such device > $ sudo loginctl attach seat1 /sys/devices/virtual/input/input23 > Could not attach device: No such device > $ sudo loginctl attach seat1 /sys/devices/virtual/input/input22 > Could not attach device: No such device > $ sudo loginctl attach seat1 /sys/devices/virtual/input/input21 > Could not attach device: No such device > > Any idea why all of them fail, and what could be a possible > workaround? See my reply here: https://lists.freedesktop.org/archives/systemd-devel/2023-September/049470.html The key is that the udev property ID_FOR_SEAT is not set for these devices. (We should definitely generate a more useful error in that case.) Only devices that have that property set can be assigned to seats. ID_FOR_SEAT is supposed to carry some form of stable ID string we can identify the device with, that remains the same between reboots. We currently set ID_FOR_SEAT to useful values for PCI and USB devices, but not on other busses. In particular virtual devices are not covered. "input23" is not useful as an identifier string in ID_FOR_SEAT, because they are assigned in the order of probing, which typically is not stable. it should suffice setting the udev property via some udev rule to something reasonable, for the devices you add... I have no idea how that looks like for your specific type of devices. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] [multiseat] How to make automatic ACL creation via udev "uaccess" tag work for seats other than seat0?
On Fr, 01.09.23 13:13, Christian Pernegger (perneg...@gmail.com) wrote: > Of course, if you want to take the position that it's a bit weird for > GNOME to use /dev/rfkill to detect the presence of BT devices, I can't > argue against that. :) Doesn't NM/bluez manage these things from privileged code anyway? Is this really done from inside the GNOME UI with direct device access? > (From a use case perspective, it would be nice if paired BT devices > could somehow be tagged. I.e. so that each seat can pair devices and > manage them, but not see or manage ones paired by other seats and/or > users.) Yeah, it would be great if bluez would gain native multi-seat support, i.e. that it tracks seat assignments for paired devices. But that's something to request from bluez upstream, not systemd. > > You cannot attach devices to multiple seats. > > Roger that. Is there a way to exempt devices from the multiseat > mechanism, though? Mark them not seat-specific? Or is that > hard-coded? change the udev rules to not set the "seat" udev property on the relevant device. That's what decides whether seat mgmt is done for the device. > > You should be able to assign the device to a different seat though. > > systemctl attach won't let me, at least not using the path seat-status > spits out. But I'm sure the version of systemd in Ubuntu 22.04 is > ancient, and/or they may have done something to it. If you like, I can > try whether adding a udev rule manually works, but personally I'm not > too bothered about this particular issue. So the problem is that the rfkill device does not carry the ID_FOR_SEAT property right now, we only add that for pci/usb/… devices, i.e. the usual busses. rfkill being a virtual device doesn't carry that property. That property carries the string identifier that we should use for identifying the device for seat mgmt purposes. It's usually derived from the path ID of the device. To make rfkill managable via "loginctl attach" would mean adding such a property there. happy to take a patch for that, please submit via github. > > that's how things work and people assume them to work: graphics render > > services are used to bring stuff to screen. > > I don't know about this. Yes, seat1 could hog the GPU that seat0's > outputs are attached to, or vice versa, but seat1 could just as well > hog all the RAM or saturate the CPU. My point being, seats share the > host's CPU power, RAM, ..., already, why not the rendering/compute > power as well. IMVHO it's really just inputs and outputs that should > be seat-specific. Restricting the shared resources available to a > given seat, allocating them fairly, etc., is a different problem (and > arguably one that I'd tackle per user and not per seat). CPU/RAM are by default resource managed, i.e. each user logged in gets a similar amount under pressure, as controlled via the cgroups logic. This is different from GPU resources, there's no such reosurce management for that. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] [multiseat] How to make automatic ACL creation via udev "uaccess" tag work for seats other than seat0?
On Fr, 01.09.23 02:02, Christian Pernegger (perneg...@gmail.com) wrote: > Am Do., 31. Aug. 2023 um 21:55 Uhr schrieb Andrei Borzenkov > : > > > > On 31.08.2023 19:22, Christian Pernegger wrote: > > There is no ID_SEAT, so this device [/dev/rfkill] ]belongs to seat0 by > > default. > > It makes no sense for /dev/rfkill to belong to a specific seat, > though. Typically any RF kill buttons are attached to the main seat of a laptop only, hence this assignment. > GNOME at least assumes the user to have write access. > Note that while /sys/devices/virtual/misc/rfkill shows up in the > output of loginctl seat-status it cannot be attached to another seat > ("Could not attach device: No such device"). You cannot attach devices to multiple seats. You should be able to assign the device to a different seat though. > Or what about /dev/kvm? Why should only seat0 have the ability to use > KVM? (It can't be attached to other seats, either.) /dev/kvm is 0666 by default, except of some distros that depart from that. Please contact them for help how they intend to manage access there, but the uaccess logic is not it. > The dri/renderD??? device is automatically attached to the seat that > the dri/card? one is attached to (even though it isn't a child > according to the seat-status tree--funnily enough this does not happen > for the fb? device). fb is obsolete. fb devices are still assigned to seats but no unpriv access is granted. > It makes sense that the rendering bits of a card should "belong to" > the seat that has the outputs, the problem is that this renders it > inaccessible to the other seats, which it shouldn't. A seat can > access another seat's *rendering capabilities* just fine as long as > the permissions are set correctly. Well, you can do lots of things. We ship defaults only. Feel free to write udev rules that assign things to whatever you want them to be assigned. By default render devices are only accessible to local users on the seat they are logged on to, not everyone else, since typically resources on a graphics card are bounded, and it makes sense to give access to users who also get access to the screen, because typically that's how things work and people assume them to work: graphics render services are used to bring stuff to screen. There's also a "render" group set up to which users can be added which should always get access. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Custom Localed Configuration Location
On Di, 29.08.23 17:18, TJ Shipp (onezoo...@msn.com) wrote: > I am trying to create a system where we can change locale on a > running system (where we would have daemons subscribe to dbus and > get the properties changed messages) but need to be able to change > the location of the locale file (by default in /etc/locale.conf) as > /etc is read-only on our system. We do not support that. /etc/ is the place for configuration on Linux, and if you make that immutable you basically turn off the ability to configure things at runtime. Which is totally OK to do of course, but if this is the mode you pick you shouldn't be surprised that this is what you get. > Is there a way to change the file location to a writeable location > as I can not find any current means to do such? This is not configurable, the path /etc/locale.conf is considered API. It's not a hidden backend or so, but a primary interface to this setting. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Additional Locale Variables for Units and Number Format
On Di, 29.08.23 17:17, TJ Shipp (onezoo...@msn.com) wrote: > I am trying to add in support for a separate variable to change our unit > system, and having both LANG and UNITS to identify the "locale" of the system. > We are also not only looking for English versus Metric, but are looking for > mixed units as well (both Imperial and Metric hybrid), as well as looking to > add number formats (1,000.00 vs 1.000,00) > > And what is the best way to add support for a new system environment variable > such as UNITS? > > P.S. If anyone is interested in contracting to do this work, please send me a > private message outside this list. systemd-devel is not the right forum for this. Not sure what a better forum for this is, but systemd is way too low-level system stuff for that. Hence, I don't know who to suggest you to contact about this, but maybe someone at the Linux Foundation can connect you. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Flushing DNS caches items on clock change.
On Mi, 30.08.23 18:06, Vishwanath Chandapur (vishwa...@gmail.com) wrote: > Hi, > > We are using systemd-resolved. We observed that on clock change, > systemd-resolved is flushing all caches. > > By looking into the code we found that this is implemented primarily for > DNSSEC. > > Is there any specific reason for flushing the other cache items like mDNS, > LLMNR? Usually things like clock jumps happen on system suspend/resume cycles, VM migration and other VM management non-linearities. Quite often this coincides with network connectivity changes, and hence we should invalidate whatever information we collected so far about the network. Given this is redundant info we can reacquire this should not be an issue. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Assertion '!ether_addr_is_null(addr)'
On Mi, 30.08.23 15:22, Mirza Krak (mirza.k...@gmail.com) wrote: > Hi, > > Environment: > * systemd: 250.5 This release is from 2021, i.e. relatively old. The issue you are descriping is almost certainly aleady addressed in newer versions. Consider using a new version. Or contact your OS vendor, asking them to maybe backport the fix in question. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Is it possible to change the cgroup uid/gid for a systemd slice?
On Mi, 30.08.23 23:08, Julio Lajara (julio.laj...@protonmail.com) wrote: > Hi all, I have created a systemd slice to constrain CPU/mem > resources for a service unit. The service unit runs as root (its a > bash script) and it runs a subprocess using systemd-run that it also > runs under the same slice but a different unprivileged user. The > subprocess needs to read the cgroup memory data directly from the > sysfs tree but it cant because its owned by root. sysfs tree? You mean cgroupfs tree? But the memory attributes are world readable, so no need to chown. > Is there way I can change the permissions on it in the slice similar > to how cgcreate has the -a option to set the uid/gid for the cgroup? There's not. chowing of cgroups is pretty much about the ability to change them or create subgroups in them, but we do not allow either to client programs for slices. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Why are the priorities of stdout and stderr the same
On Di, 29.08.23 11:56, Cecil Westerhof (cldwester...@gmail.com) wrote: > > I agree with that usecase, and we have discussed this many times > > before, but we couldn#t come up with a nice way to make everything > > work: proper ordering and distintion of stdout/stderr. > > I agree that the default behaviour is the right one. But why not give > people the possibility to override this behaviour? When they override it > themselves, they cannot complain that they lose ordering. We are generally conservative when providing mechanisms that are too glaringly broken. Even if they are opt-in. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Why are the priorities of stdout and stderr the same
On Di, 29.08.23 11:20, Cecil Westerhof (cldwester...@gmail.com) wrote: > Also: everything has a timestamp, so there is in my opinion when you choose > to take them apart no big problem. For stream connections like those used for stdout/stderr, lines do not come with timestamps. We add them on the reception side, which is too late. > > > To get what is send to stderr I had to do: > > > journalctl -p 6 -u aptCacheUsage.service > > > > > > which gave beside a lot of other things the things send to stdout. > > > > > > Now I have two different statements I can do: > > > journalctl -p 3 -u aptCacheUsage.service > > > > > > But it would be nice if I did not need two different statements (and the > > > logic around that) for that. > > > > Still not getting what you are trying to say here. > > > > Often I am only interested in what is sent to stderr and do not want what > is sent to stdout. When both have the same log level I can not really > filter on messages sent to stderr. At the moment I want to see the messages > sent to stderr, I will also get the messages sent to stdout because they > have the same error level. I agree with that usecase, and we have discussed this many times before, but we couldn#t come up with a nice way to make everything work: proper ordering and distintion of stdout/stderr. The closest I cam was using two distinct SOCK_DGRAM sockets connect()ed to the same target socket (instead of the current approach of using SOCK_STREAM). This would give us two benefits: for each deliverd datagram we would get a source socket address reported to us, and it will tell us which of the two source sockets it was, hence hence if stdout or stderr. Moreover, we would get a kernel-supplied kernel timestamp on each datagram if we want. This however has a fairly big problem too: if programs write too much data into their stdout/stderr at once they would get EMSGSIZE back, which programs generally don't expect (i.e. if write()'s size is larger than datagram max size you get EMSGSIZE). Programs trying to write too much usually expect blocking behaviour... Thus this approach is not really an option. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Why are the priorities of stdout and stderr the same
On Sa, 26.08.23 06:14, Cecil Westerhof (cldwester...@gmail.com) wrote: Please keep mails like this on the mailing list. > > We should not "assume the worst", hence given that the stderr stream > > is typically used for all kinds of informational messages we should > > not always assume its an error, because quite often its just > > informational. > > > > You have a very good point. When tcl opens a process for reading, it is an > error when there is something to read on stderr, except when you overrule > it. But that you can overrule it proves your point. > > > Hence, we use LOG_INFO if we have no clue simply because that's the > > "best assumption". > > > > I agree, but I would suggest a very simple solution. > There is SyslogLevel which sets the syslog level for stdout and stderr. I > would suggest adding SyslogLevelStderr. SyslogLevel would still set it for > both except when there is also SyslogLevelStderr. When journal redirection of both stdout + stderr is enabled for systemd services we'll connect a single pipe to both fds, in order to guarantee ordering, i.e. ensure that if something is written to stdout, and then something to stderr, we'll definitely process it in this order too. This however means, that on the receiving side we cannot distinguish stdout/stderr anymore, it's all one stream. Hence we can only choose between: guarantee correct ordering OR ability to distinguish stdout/stderr. We opted for the former, as corrupted ordering between stdout/stderr is just too confusing for users. > > We generally recommend apps to use syslog() or sd-journal APIs to > > generate their log messages and specify the log level for each message > > explicitly, to avoid any doubts. Many programming language's logging > > frameworks natively have support for these. > > The script I use can be run from the command-line and from a service. > Because of that I have to use: > logMsg --simple "${message}" >&2 > and: > echo "<3>$(logMsg --simple "${message}")" >&2 > > doable but inconvenient. > > > Now when I want the things send to stderr I also get the things send to > > > stdout. > > > > I can't parse that. > > > > To get what is send to stderr I had to do: > journalctl -p 6 -u aptCacheUsage.service > > which gave beside a lot of other things the things send to stdout. > > Now I have two different statements I can do: > journalctl -p 3 -u aptCacheUsage.service > > But it would be nice if I did not need two different statements (and the > logic around that) for that. Still not getting what you are trying to say here. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Append to logfile with year-month
On Do, 24.08.23 09:48, Cecil Westerhof (cldwester...@gmail.com) wrote: > In a service file I can use: > StandardOutput=append:/var/log/root/aptCacheUsage.log > > but I want to use something like: > StandardOutput=append:/var/log/root/aptCacheUsage_$(date +%%Y-%%m).log > > Did does not work, because this puts it in: > /var/log/root/aptCacheUsage_$(date +%Y-%m).log > > Is there a way I can put it in: > /var/log/root/aptCacheUsage_2023-08.log > > while it would automatically next month go into: >/var/log/root/aptCacheUsage_2023-09.log > > I could of-course put it into: > /var/log/root/aptCacheUsage.log > > and at the beginning of the month move it if it exists with a timed > service, but I really would not like that kind of solution. We do not support this. systemd supports evaluating some specifiers, but time/date is not one of them, in particular as we resolve specifiers at parse time of the unit only, not afterwards. or in other words: we'd resolve the specifiers early at boot, and that doesn't look like what you want. Also, for long-running services this wouldn#t work anyway, as we can't rotate files like that, because we cannot externally close the current stdout of a process and replace it with a new file. hence, what you are trying to do is not supported, and is unlikely to ever be supported for multiple reasons. sorry! Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Why are the priorities of stdout and stderr the same
On Do, 24.08.23 16:31, Cecil Westerhof (cldwester...@gmail.com) wrote: > Normally in a script when something is send to stdout it is seen as an > error has occurred. > But in systemd both get a priority of 6 (info). > Why does stderr not get a priority of 3 (err), or at least lower as > stdout? stderr is a bit of a misnomer, it's not just for errors, it's also for progress output, informational output and so, basically everything that is not considered the primary output contents that one would want to propagate in pipelines. We should not "assume the worst", hence given that the stderr stream is typically used for all kinds of informational messages we should not always assume its an error, because quite often its just informational. Hence, we use LOG_INFO if we have no clue simply because that's the "best assumption". We generally recommend apps to use syslog() or sd-journal APIs to generate their log messages and specify the log level for each message explicitly, to avoid any doubts. Many programming language's logging frameworks natively have support for these. > Now when I want the things send to stderr I also get the things send to > stdout. I can't parse that. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] Error during SCC_DAEMON installation
On Do, 24.08.23 13:28, Maber, Paul (paul.ma...@cgi.com) wrote: > Classification: Confidential > > When installing the SAP Cloud Connector, I am getting the following errors. > The installation is being performed by the user root as instructed. > > :/opt/sap/scc # journalctl -xeu scc_daemon.service > Aug 24 13:41:35 scc_daemon[5574]: scc_Daemon start failed, see > logfile: /opt/sap/scc/scc_daemon.log systemd is just the messenger here. Please contact SAP for help on this SAP product, not the systemd project. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] systemd-cryptenroll with TPM2
On Di, 22.08.23 22:35, Aleksandar Kostadinov (akost...@redhat.com) wrote: > On Tue, Aug 22, 2023 at 8:10 PM Lennart Poettering > wrote: > > On Di, 22.08.23 19:16, Aleksandar Kostadinov (akost...@redhat.com) wrote: > <...> > > > If attacker replaces volume with unencrypted one, and it boots without > > > messing up the sealing PCRs, then probably attacker can query the TPM > > > and obtain the encryption key. Despite the fact that this is not (yet) > > > implemented in cryptenroll. > > > > Sure, if you allow unencrypted systems to boot in your OS then all > > bets are off. You shouldn't do that of course. > > > > (in my model of mind, where automatic GPT image dissection is used the > > image dissection policies are how this should be locked down, see > > systemd.image-policy(7). You can confgure that via the kernel cmdline: > > in systemd.image_policy=. > > > > In systemd there's the "systemd-pcrfs@.service" and > > "systemd-pcrmachine.service" which will measure the identity of file > > systems and of /etc/machine-id into PCR 15. (systemd-cryptsetup also > > mesures a derivate of the volume key to PCR 15). PCR 15 is supposed to > > be an identifier of the OS instance. > > Wait. I was looking at this PCR. But wouldn't it be set only after the > volume has been unlocked? This means that before a volume is unlocked, > it cannot protect anything? Actually it may protect in case where > attacker replaced the volume with another encrypted volume. But not if > attacker replaced with a plain volume. As I said earlier: if you don't encrypt you lost anyway. This is not a scenario I care about in my view of the world. And frankly, it really doesn't make much sense to try to lock down boot but not actually encrypt the disk... > Or is it measured with the *encrypted* volume key which would actually > protect from volume replacement of any sort (I think) and would mostly > solve my concern? No, we measure the decrypted volume key (or actually, we measure the result of an HMAC of a fixed string, keyed by the volume key, since we don't want the key to show up in measurement logs in any useable way). > I mean if somehow the LVM structure including the encrypted key(s) are > measured somewhere, then such an attack should not be viable. LVM? what's LVM got to do with anything? > I guess I should test whether replacing the volume with non-encrypted > will work. If it works, then there might be an issue. If it does not > work, then sealing with PCR 15 might be what will get me going, > because replacing with an encrypted volume will definitely modify it > and block decrypting of the original key. In my view of the world you have an authenticated + measured UKI that unlocks the encrypted root fs, and simply refuses to boot if the root fs is not encrypted with a key it can acquire somehow. This should give you all the protection you need. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] systemd-cryptenroll with TPM2
On Di, 22.08.23 19:16, Aleksandar Kostadinov (akost...@redhat.com) wrote: > > > I'm concerned though about an attacker replacing the encrypted root volume > > > with a non-encrypted one. Which may result in system booting an attacker > > > controlled environment while PCRs may be in a state that allows decryption > > > of the original root volume. > > > > > > Would anything prevent the system from booting with a replaced root > > > volume? > > > > Well, when you bind your disk to the TPM then this means you place a > > TPM-encrypted key in the LUKS header. This key has to be passed to the > > right TPM to be unlocked. This means that if an attacker just has the > > disk it's hard for them to acquire the decrypted key if it lacks the > > TPM. But it also means that if an attacker wants to replace the disk > > its very hard to forge key that is locked against that specific TPM. > > If attacker replaces volume with unencrypted one, and it boots without > messing up the sealing PCRs, then probably attacker can query the TPM > and obtain the encryption key. Despite the fact that this is not (yet) > implemented in cryptenroll. Sure, if you allow unencrypted systems to boot in your OS then all bets are off. You shouldn't do that of course. (in my model of mind, where automatic GPT image dissection is used the image dissection policies are how this should be locked down, see systemd.image-policy(7). You can confgure that via the kernel cmdline: in systemd.image_policy=. In systemd there's the "systemd-pcrfs@.service" and "systemd-pcrmachine.service" which will measure the identity of file systems and of /etc/machine-id into PCR 15. (systemd-cryptsetup also mesures a derivate of the volume key to PCR 15). PCR 15 is supposed to be an identifier of the OS instance. > > It analyzes the UEFI TPM event log (which lists all measurements made > > to PCRs), tries to recognize components in it safely. And then is > > supposed to use that to generate signed PCR policies from that, based > > on a keypair stored on the local TPM, that is itself protected by one > > of its own signed PCR policies. > > > > In the long run the way I envision this we'd have two signed PCR > > policies in place: > > > > 1. A vendor supplied one that covers the UKI and its resources (this > >already pretty much exists), i.e. PCR 11. This one is pre-computed > >at build time of the OS and hence can only cover resources known at > >that time. > > > > 2. A locally maintained one on the individual system, based on a local > >key, that covers everything inherently local that is hard to > >predict from the outside (and for good measure also covers the > >vendor supplied stuff, because why not). This would then cover PCRs > >0-7, 9, 11-13, 15, i.e. everything that is reasonably stable > >locally. > > > > Alas, as mentioned this is WIP, still. > > I didn't expect the unattended server TPM2 encryption to be such a > muddy ground. Probably because serious use cases also involve more > infrastructure and dedicated admins, etc. It is certainly my intention to make this all "just work" and "default on", even on consumer hw. Windows does it, so we should be able to do that as well. Lennart -- Lennart Poettering, Berlin
Re: [systemd-devel] systemd-cryptenroll with TPM2
On Mo, 21.08.23 19:56, Aleksandar Kostadinov (akost...@redhat.com) wrote: > Thanks, this is what I was also considering the feasibility of. And whether > it made sense to begin with. Any idea how can this be done with systemd? > > In man I read: > > > Note that currently when enrolling a new key of one of the five > > supported types listed above, it is required to first provide a > > passphrase, a recovery key or a FIDO2 token. It's currently not > > supported to unlock a device with a TPM2/PKCS#11 key in order to > enroll > > a new TPM2/PKCS#11 key. Thus, if in future key roll-over is desired > > So I wonder if systemd already does that, or is it just an artificial > limitation? Would be wonderful if it already did so. It's just that noone implemented this. The unlocking code paths via cryptsetup and in cryptenroll are quite different, which doesn't make this trivial. But pacthes welcome. Generally, I am very much of the opinion that we shouldn't change the disk whenever PCRs change. Instead we should use signed PCR policies to accomodate for "clean" PCR changes (as mentioned in the other mail in this thread), i.e. simply sign a new PCR policy if we learn about a new "golden" PCR state we want to permit. This is much more robust and scales better. Moreover, it makes it easy to invalidate old golden states, by implicitly binding things to an nvindex counter object in the TPM at the same time. Such rollback protection is kinda crucial I am sure to guarantee security of non-interactive systems. > P.S. Also another thing I was considering was that if I did this > "extension", then I'm not sure how to then properly setup the sealing. But > maybe with the signed PCRs support it can work as PCRs don't need to be in > the expected state at configuration time. But also I want to do with as > little modifications from defaults as possible. If I have to rewrite the > whole thing, it will be hard but also I don't want to risk making mistakes > that original scripts already avoid. Neither for the literla PCR policies nor for the signed PCR policies the PCRs actailly need to be in the state we expected states when enrolling. Support for the former was recently added upstream. Lennart -- Lennart Poettering, Berlin