Re: [systemd-devel] Delaying VM startup until block devices are available

2024-01-26 Thread Andrei Borzenkov

On 27.01.2024 00:40, Orion Poplawski wrote:

On 1/26/24 01:21, Lennart Poettering wrote:

On Do, 25.01.24 16:28, Orion Poplawski (or...@nwra.com) wrote:


We have various VMs that are back by luks encrypted LVs.  At boot the volumes
are decrypted by clevis.  The problem we are seeing at the moment is that the
VMs are started before the block devices are decrypted.  Our current
solution is:


We generally wait for all devices listed in /etc/crypttab, unless you
set noauto or nofail.


We are setting 'nofail', because I don't think I want to fail the boot in
general.  They are not required for the system itself to function, just
certain VMs. e.g:

luks-backup /dev/vg_root/backup-raw none discard,_netdev,nofail

See below for more though.


# cat /etc/systemd/system/virtqemud.service.d/override.conf
[Unit]
After=blockdev@dev-mapper-luks\x2dbackup.target
blockdev@dev-mapper-luks\x2dvm\x2d01\x2ddisk0.target

Where we list each of the volumes to be decyrpted as blocking the virtqemud
service.

Does anyone have any better alternatives?  My main issue it that it feels
somewhere in between fine-grained and coarse-grained control.

Ideally I think one would be able to have each individual VM startup
automatically delayed until the devices each used became available, but I
don't see how to do this.


I am not sure how libvirt works, but if it runs every VM in a systemd
unit, then you could just order the device before that unit, or the
unit after the device.

Really depends on how libvirt splits things up.


I'm honestly not sure how libvirt works here either.  But there seems to be 
this:

# rpm -qf /usr/lib/systemd/system/virtqemud.service
libvirt-daemon-driver-qemu-9.5.0-7.el9_3.alma.2.x86_64

which gets started:

Jan 25 14:42:58 systemd[1]: Starting Virtualization qemu daemon...
Jan 25 14:42:58 systemd[1]: Started Virtualization qemu daemon.

Then the qemu-kvm processes end up in their own scope:

● machine-qemu\x2d1\x2dsrv\x2dmry01.scope - Virtual Machine qemu-1-srv-mry01
  Loaded: loaded
(/run/systemd/transient/machine-qemu\x2d1\x2dsrv\x2dmry01.scope; transient)
   Transient: yes
  Active: active (running) since Thu 2024-01-25 14:42:58 PST; 22h ago
   Tasks: 6 (limit: 16384)
  Memory: 15.6G
 CPU: 1h 15min 44.863s
  CGroup: /machine.slice/machine-qemu\x2d1\x2dsrv\x2dmry01.scope
  └─libvirt
└─9086 /usr/libexec/qemu-kvm -name guest=...




Alternatively it seems like one should be able to delay all VM startup until
all volumes in /etc/crypttab were unlocked, rather than having to specify each
one.  But I don't see a target for that.


This is default behaviour. Anything listed in /etc/crypttab is ordered
before cryptsetup.target, which is ordered before sysinit.target,
which is ordered before basic.target, which is ordered before regular services.


We are specifying _netdev because they require the network to unlock.  This I
think puts them under remote-cryptsetup.target, and I used to depend on that.
But with EL9 I'm seeing:

# j -b -u remote-cryptsetup.target -u
'blockdev@dev-mapper-luks\x2dbackup.target' -u clevis-luks-askpass.service
--no-hostname

Jan 25 14:42:12 systemd[1]: Reached target Remote Encrypted Volumes.
Jan 25 14:42:12 systemd[1]: Started Forward Password Requests to Clevis.
Jan 25 14:42:48 clevis-luks-askpass[1706]: Unlocked /dev/vg_root/backup-raw
(UUID=d6d25a85-2d43-4780-a312-e0e9b2383807) successfully
Jan 25 14:42:54 systemd[1]: Reached target Block Device Preparation for
/dev/mapper/luks-backup.
Jan 25 14:42:59 systemd[1]: clevis-luks-askpass.service: Deactivated 
successfully.

# systemctl list-dependencies remote-cryptsetup.target
remote-cryptsetup.target
● ├─systemd-cryptsetup@luks\x2dbackup.service

# j --no-hostname -b -u 'systemd-cryptsetup@luks\x2dbackup.service'
Jan 25 14:42:12 systemd[1]: Starting Cryptography Setup for luks-backup...
Jan 25 14:42:42 systemd-cryptsetup[1697]: Set cipher aes, mode xts-plain64,
key size 512 bits for device /dev/vg_root/backup-raw.
Jan 25 14:42:47 systemd-cryptsetup[1697]: Failed to activate with specified
passphrase. (Passphrase incorrect?)
Jan 25 14:42:48 systemd-cryptsetup[1697]: Set cipher aes, mode xts-plain64,
key size 512 bits for device /dev/vg_root/backup-raw.
Jan 25 14:42:54 systemd[1]: Finished Cryptography Setup for luks-backup.

# systemctl show 'systemd-cryptsetup@luks\x2dbackup.service' | grep Type
Type=oneshot

So, if I'm following things correctly, this doesn't seem right.
remote-cryptsetup.target depends on systemd-cryptsetup@luks\x2dbackup.service.
  This is a oneshot that is considered started after the main process exits,
and above is shown as 14:42:54.  But we are seeing 'Reached target Remote
Encrypted Volumes' at 14:42:12.

What am I missing?

systemd-252-18.el9.x86_64




"nofail" encrypted devices are not ordered before 
(remote-)cryptsetup.target to not delay startup. The reasoning is, if 
you do not care whether this device exists or not, there is no reason to 

Re: [systemd-devel] Bump: Testing LogFilterPatterns= on user-level services

2024-01-26 Thread Nils Kattenbeck
> > Interepreting arbitrary regexes configured by unpriv code in priv code
> > comes at some risk,. becose afair constructing them can come at O(2^n)
> > time, i.e. a rogue regex could make use consume unbounded time on
> > processing journal messages.
>
> Which regex engine is used?  glibc’s engine is not safe for use with
> untrusted input, but Rust’s is, so that might be an option in the
> future.  It isn’t OOM-safe, though.

Rust isn't used. To my knowledge libpcre2 is used (at least it was at
the time the feature landed).
That library does not seem to allow setting any restrictions which
would make it possible to parse untrusted input.
For how exactly this works for rust see the documentation of the crate:
https://docs.rs/regex/latest/regex/index.html#untrusted-input

So in theory it is certainly possible to allow a regex subset though I
am not aware of any C library which does this.
A simple workaround we have done in a project I work on is to restrict
the set of allowed characters.
Doing it that way however puts more restrictions in place than
theoretically possible.

For the foreseeable future I agree with Lennart that documenting this
quirk should be the most important thing.
Afterwards this could be made configurable somehow, as well as showing
a message or exiting with a non-zero code to indicate that this is not
allowed.


Re: [systemd-devel] Delaying VM startup until block devices are available

2024-01-26 Thread Orion Poplawski
On 1/26/24 01:21, Lennart Poettering wrote:
> On Do, 25.01.24 16:28, Orion Poplawski (or...@nwra.com) wrote:
> 
>> We have various VMs that are back by luks encrypted LVs.  At boot the volumes
>> are decrypted by clevis.  The problem we are seeing at the moment is that the
>> VMs are started before the block devices are decrypted.  Our current
>> solution is:
> 
> We generally wait for all devices listed in /etc/crypttab, unless you
> set noauto or nofail.

We are setting 'nofail', because I don't think I want to fail the boot in
general.  They are not required for the system itself to function, just
certain VMs. e.g:

luks-backup /dev/vg_root/backup-raw none discard,_netdev,nofail

See below for more though.

>> # cat /etc/systemd/system/virtqemud.service.d/override.conf
>> [Unit]
>> After=blockdev@dev-mapper-luks\x2dbackup.target
>> blockdev@dev-mapper-luks\x2dvm\x2d01\x2ddisk0.target
>>
>> Where we list each of the volumes to be decyrpted as blocking the virtqemud
>> service.
>>
>> Does anyone have any better alternatives?  My main issue it that it feels
>> somewhere in between fine-grained and coarse-grained control.
>>
>> Ideally I think one would be able to have each individual VM startup
>> automatically delayed until the devices each used became available, but I
>> don't see how to do this.
> 
> I am not sure how libvirt works, but if it runs every VM in a systemd
> unit, then you could just order the device before that unit, or the
> unit after the device.
> 
> Really depends on how libvirt splits things up.

I'm honestly not sure how libvirt works here either.  But there seems to be 
this:

# rpm -qf /usr/lib/systemd/system/virtqemud.service
libvirt-daemon-driver-qemu-9.5.0-7.el9_3.alma.2.x86_64

which gets started:

Jan 25 14:42:58 systemd[1]: Starting Virtualization qemu daemon...
Jan 25 14:42:58 systemd[1]: Started Virtualization qemu daemon.

Then the qemu-kvm processes end up in their own scope:

● machine-qemu\x2d1\x2dsrv\x2dmry01.scope - Virtual Machine qemu-1-srv-mry01
 Loaded: loaded
(/run/systemd/transient/machine-qemu\x2d1\x2dsrv\x2dmry01.scope; transient)
  Transient: yes
 Active: active (running) since Thu 2024-01-25 14:42:58 PST; 22h ago
  Tasks: 6 (limit: 16384)
 Memory: 15.6G
CPU: 1h 15min 44.863s
 CGroup: /machine.slice/machine-qemu\x2d1\x2dsrv\x2dmry01.scope
 └─libvirt
   └─9086 /usr/libexec/qemu-kvm -name guest=...

> 
>> Alternatively it seems like one should be able to delay all VM startup until
>> all volumes in /etc/crypttab were unlocked, rather than having to specify 
>> each
>> one.  But I don't see a target for that.
> 
> This is default behaviour. Anything listed in /etc/crypttab is ordered
> before cryptsetup.target, which is ordered before sysinit.target,
> which is ordered before basic.target, which is ordered before regular 
> services.

We are specifying _netdev because they require the network to unlock.  This I
think puts them under remote-cryptsetup.target, and I used to depend on that.
But with EL9 I'm seeing:

# j -b -u remote-cryptsetup.target -u
'blockdev@dev-mapper-luks\x2dbackup.target' -u clevis-luks-askpass.service
--no-hostname

Jan 25 14:42:12 systemd[1]: Reached target Remote Encrypted Volumes.
Jan 25 14:42:12 systemd[1]: Started Forward Password Requests to Clevis.
Jan 25 14:42:48 clevis-luks-askpass[1706]: Unlocked /dev/vg_root/backup-raw
(UUID=d6d25a85-2d43-4780-a312-e0e9b2383807) successfully
Jan 25 14:42:54 systemd[1]: Reached target Block Device Preparation for
/dev/mapper/luks-backup.
Jan 25 14:42:59 systemd[1]: clevis-luks-askpass.service: Deactivated 
successfully.

# systemctl list-dependencies remote-cryptsetup.target
remote-cryptsetup.target
● ├─systemd-cryptsetup@luks\x2dbackup.service

# j --no-hostname -b -u 'systemd-cryptsetup@luks\x2dbackup.service'
Jan 25 14:42:12 systemd[1]: Starting Cryptography Setup for luks-backup...
Jan 25 14:42:42 systemd-cryptsetup[1697]: Set cipher aes, mode xts-plain64,
key size 512 bits for device /dev/vg_root/backup-raw.
Jan 25 14:42:47 systemd-cryptsetup[1697]: Failed to activate with specified
passphrase. (Passphrase incorrect?)
Jan 25 14:42:48 systemd-cryptsetup[1697]: Set cipher aes, mode xts-plain64,
key size 512 bits for device /dev/vg_root/backup-raw.
Jan 25 14:42:54 systemd[1]: Finished Cryptography Setup for luks-backup.

# systemctl show 'systemd-cryptsetup@luks\x2dbackup.service' | grep Type
Type=oneshot

So, if I'm following things correctly, this doesn't seem right.
remote-cryptsetup.target depends on systemd-cryptsetup@luks\x2dbackup.service.
 This is a oneshot that is considered started after the main process exits,
and above is shown as 14:42:54.  But we are seeing 'Reached target Remote
Encrypted Volumes' at 14:42:12.

What am I missing?

systemd-252-18.el9.x86_64


-- 
Orion Poplawski
he/him/his  - surely the least important thing about me
Manager of IT Systems  720-772-5637
NWRA, Boulder/CoRA Office 

Re: [systemd-devel] Bump: Testing LogFilterPatterns= on user-level services

2024-01-26 Thread Demi Marie Obenour
On Fri, Jan 26, 2024 at 09:11:24AM +0100, Lennart Poettering wrote:
> On Do, 25.01.24 22:29, Farblos (akfkqu.9df...@vodafonemail.de) wrote:
> 
> > Hi.
> >
> > I sent below mail some week ago, Barry's reply left me unsure as to
> > whether this would be a bug or not.  I still tend do assume that I'm
> > "doing something wrong".
> 
> This is currently not supported. The filters are communicated by the
> service manager to journald via xattrs on the cgroups, and journald
> will only consider those for cgroups owned by root, i.e. not on
> cgroups delegated to unpriv users like this done for systemd --user
> instances.
> 
> Interepreting arbitrary regexes configured by unpriv code in priv code
> comes at some risk,. becose afair constructing them can come at O(2^n)
> time, i.e. a rogue regex could make use consume unbounded time on
> processing journal messages.

Which regex engine is used?  glibc’s engine is not safe for use with
untrusted input, but Rust’s is, so that might be an option in the
future.  It isn’t OOM-safe, though.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab


signature.asc
Description: PGP signature


[systemd-devel] umount fails on system with huge (2TiB) buff/cache

2024-01-26 Thread Holger Kiehl
Hello,

with huge memory buff/cache a umount can take very long:

   time umount /mnt/u2

   real 10m43.026s
   user 0m0.000s
   sys  10m27.985s

which then causes a system not to umount the device properly when
rebooting, watching serial console:

   [  OK  ] Deactivated swap /dev/sda4.
   [  OK  ] Unmounted /boot.
   [  OK  ] Deactivated swap /dev/disk…0-c79c-4437-922e-21bb8c926cf8.
   [  OK  ] Stopped File System Check …9-80c6-4dc9-86db-f3a9ec179b3c.
   [244820.237512] audit: type=1131 audit(1706268318.490:23423): pid=1 uid=0 
auid=4294967295 ses=4294967295 
msg='unit=systemd-fsck@dev-disk-by\x2duuid-d079cce9\x2d80c6\x2d4dc9\x2d86db\x2df3a9ec179b3c
 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? 
res=success'
   [  OK  ] Stopped File System Check on /dev/mapper/datao1db.
   [244820.273498] audit: type=1131 audit(1706268318.526:23424): pid=1 uid=0 
auid=4294967295 ses=4294967295 msg='unit=systemd-fsck@dev-mapper-datao1db 
comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? 
res=success'
   [  OK  ] Unmounted /mnt/hdd1.
   [  OK  ] Stopped File System Check on /dev/mapper/datao1da.
   [244820.409479] audit: type=1131 audit(1706268318.662:23425): pid=1 uid=0 
auid=4294967295 ses=4294967295 msg='unit=systemd-fsck@dev-mapper-datao1da 
comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? 
res=success'
   [   ***] A stop job is running for /mnt/u2 (2min 55s / no limit)

Note it states 'no limit' and one can see after some minutes it says
it umounted /mnt/u2:

   [  OK  ] Unmounted /mnt/u2.
   [  OK  ] Reached target Unmount All Filesystems.
   [  OK  ] Stopped File System Check …f-5fca-4f3b-b51d-9bd61b3afed2.
   [245090.42] audit: type=1131 audit(1706268588.681:23426): pid=1 uid=0 
auid=4294967295 ses=4294967295 
msg='unit=systemd-fsck@dev-disk-by\x2duuid-37da47ff\x2d5fca\x2d4f3b\x2db51d\x2d9bd61b3afed2
 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? 
res=success'
   [  OK  ] Removed slice Slice /system/systemd-fsck.
   [  OK  ] Stopped target Preparation for Local File Systems.
Stopping Device-Mapper Multipath Device Controller...
   [  OK  ] Stopped Remount Root and Kernel File Systems.
   [245090.487727] audit: type=1131 audit(1706268588.741:23427): pid=1 uid=0 
auid=4294967295 ses=4294967295 msg='unit=systemd-remount-fs comm="systemd" 
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
   [  OK  ] Stopped File System Check on Root Device.
   [245090.516713] audit: type=1131 audit(1706268588.770:23428): pid=1 uid=0 
auid=4294967295 ses=4294967295 msg='unit=systemd-fsck-root comm="systemd" 
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
   [  OK  ] Stopped Create Static Device Nodes in /dev.
   [245090.544717] audit: type=1131 audit(1706268588.798:23429): pid=1 uid=0 
auid=4294967295 ses=4294967295 msg='unit=systemd-tmpfiles-setup-dev 
comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? 
res=success'
   [  OK  ] Stopped Device-Mapper Multipath Device Controller.
   [245090.575750] audit: type=1131 audit(1706268588.829:23430): pid=1 uid=0 
auid=4294967295 ses=4294967295 msg='unit=multipathd comm="systemd" 
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
   [  OK  ] Reached target System Shutdown.
   [  OK  ] Reached target Late Shutdown Services.
   [  OK  ] Finished System Reboot.
   [245090.619726] audit: type=1130 audit(1706268588.873:23431): pid=1 uid=0 
auid=4294967295 ses=4294967295 msg='unit=systemd-reboot comm="systemd" 
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
   [245090.639307] audit: type=1131 audit(1706268588.873:23432): pid=1 uid=0 
auid=4294967295 ses=4294967295 msg='unit=systemd-reboot comm="systemd" 
exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
   [  OK  ] Reached target System Reboot.
   [245090.672034] audit: type=1334 audit(1706268588.925:23433): prog-id=60 
op=UNLOAD
   [245090.679574] audit: type=1334 audit(1706268588.925:23434): prog-id=59 
op=UNLOAD
   [245090.687105] audit: type=1334 audit(1706268588.927:23435): prog-id=63 
op=UNLOAD
   [245120.718949] systemd-shutdown[1]: Syncing filesystems and block devices - 
timed out, issuing SIGKILL to PID 3377584.
   [245120.730995] systemd-journald[2863]: Failed to send WATCHDOG=1 
notification message: Connection refused
   [245120.774387] systemd-journald[2863]: Received SIGTERM from PID 1 
(systemd-shutdow).
   [245130.775151] systemd-shutdown[1]: Waiting for process: 3377515 (umount), 
3377584 ((sd-sync))
   [245210.836634] systemd-shutdown[1]: Sending SIGKILL to PID 3377515 (umount).
   [245210.857197] systemd-shutdown[1]: Sending SIGKILL to PID 3377584 
((sd-sync)).
   [245220.880927] systemd-shutdown[1]: Waiting for process: 3377515 (umount), 
3377584 ((sd-sync))
   [245280.135213] INFO: task (sd-sync):3377584 blocked for more than 122 
seconds.
   [245280.155688] 

Re: [systemd-devel] Bump: Testing LogFilterPatterns= on user-level services

2024-01-26 Thread Nils Kattenbeck
> Interepreting arbitrary regexes configured by unpriv code in priv code
> comes at some risk,. becose afair constructing them can come at O(2^n)
> time, i.e. a rogue regex could make use consume unbounded time on
> processing journal messages.
>
> Hence, I wouldn't hold your breath. Unless someone figures out a smart
> way to deal with this it's unlikely to be supported.

I am not sure about construction but checking for matches with
arbitrary regexes can definitely result in DOS.
Restricting the allowed features, however, alleviates this problem.
E.g. the rust regex crate can check in O(m*n) with m = Regex Size and
n = Input size.
It does this by now allowing (amongst other things) no look-arounds or backrefs.
I am not sure how configurable pcre2pattern is but maybe the supported
features could be restricted for regexes from users.

Nils


Re: [systemd-devel] Delaying VM startup until block devices are available

2024-01-26 Thread Lennart Poettering
On Do, 25.01.24 16:28, Orion Poplawski (or...@nwra.com) wrote:

> We have various VMs that are back by luks encrypted LVs.  At boot the volumes
> are decrypted by clevis.  The problem we are seeing at the moment is that the
> VMs are started before the block devices are decrypted.  Our current
> solution is:

We generally wait for all devices listed in /etc/crypttab, unless you
set noauto or nofail.

>
> # cat /etc/systemd/system/virtqemud.service.d/override.conf
> [Unit]
> After=blockdev@dev-mapper-luks\x2dbackup.target
> blockdev@dev-mapper-luks\x2dvm\x2d01\x2ddisk0.target
>
> Where we list each of the volumes to be decyrpted as blocking the virtqemud
> service.
>
> Does anyone have any better alternatives?  My main issue it that it feels
> somewhere in between fine-grained and coarse-grained control.
>
> Ideally I think one would be able to have each individual VM startup
> automatically delayed until the devices each used became available, but I
> don't see how to do this.

I am not sure how libvirt works, but if it runs every VM in a systemd
unit, then you could just order the device before that unit, or the
unit after the device.

Really depends on how libvirt splits things up.

> Alternatively it seems like one should be able to delay all VM startup until
> all volumes in /etc/crypttab were unlocked, rather than having to specify each
> one.  But I don't see a target for that.

This is default behaviour. Anything listed in /etc/crypttab is ordered
before cryptsetup.target, which is ordered before sysinit.target,
which is ordered before basic.target, which is ordered before regular services.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Bump: Testing LogFilterPatterns= on user-level services

2024-01-26 Thread Lennart Poettering
On Do, 25.01.24 22:29, Farblos (akfkqu.9df...@vodafonemail.de) wrote:

> Hi.
>
> I sent below mail some week ago, Barry's reply left me unsure as to
> whether this would be a bug or not.  I still tend do assume that I'm
> "doing something wrong".

This is currently not supported. The filters are communicated by the
service manager to journald via xattrs on the cgroups, and journald
will only consider those for cgroups owned by root, i.e. not on
cgroups delegated to unpriv users like this done for systemd --user
instances.

Interepreting arbitrary regexes configured by unpriv code in priv code
comes at some risk,. becose afair constructing them can come at O(2^n)
time, i.e. a rogue regex could make use consume unbounded time on
processing journal messages.

Hence, I wouldn't hold your breath. Unless someone figures out a smart
way to deal with this it's unlikely to be supported.

We should document this however I guess. Hence if you file an issue
that would be more than welcome, so that we can keep trakc of this.

Lennart

--
Lennart Poettering, Berlin