Re: [systemd-devel] Starting a service before any networking

2023-09-29 Thread Jetchko Jekov
Actually, I believe the dhcpcd service is the wrong one here:

Looking at the dhcpcd.service's Unit section (in F39 at least) I see:

[Unit]
Description=A minimalistic network configuration daemon with DHCPv4,
rdisc and DHCPv6 support
Wants=network.target
Before=network.target

So it orders itself *before* network.target but only this is not enough.
It must also order itself After=network-pre.target

>From the docs:
network-pre.target  This passive target unit may be pulled in by
services that want to run before any network is set up, for example
for the purpose of setting up a firewall. All network management
software orders itself after this target, but does not pull it in.

And dhcpcd is a network management software

On Thu, Sep 28, 2023 at 4:46 PM Mantas Mikulėnas  wrote:
>
> On Wed, Sep 27, 2023 at 12:31 PM Mark Rogers  
> wrote:
>>
>> On Wed, 27 Sept 2023 at 10:18, Mantas Mikulėnas  wrote:
>>>
>>> So now I'm curious: if the first command you run is to bring the interface 
>>> *down*, then what exactly brought it up?
>>
>>
>> Good question. The reason for down/up was that this was working as a way to 
>> reset the connection after boot, so I just transferred that to the 
>> ExecStartPre.
>>
>> Looking at the "journalctl -u dhcpcd" output, this is what I see from my 
>> last boot:
>> Feb 14 10:12:05 pi systemd[1]: Starting dhcpcd on all interfaces...
>> Feb 14 10:12:05 pi ip[372]: 2: eth0:  mtu 1500 qdisc 
>> noop state DOWN group default qlen 1000
>> Feb 14 10:12:05 pi ip[372]: link/ether b8:27:eb:0d:ee:bb brd 
>> ff:ff:ff:ff:ff:ff
>> Feb 14 10:12:05 pi ip[383]: 2: eth0:  mtu 
>> 1500 qdisc pfifo_fast state DOWN group default qlen 1000
>> Feb 14 10:12:05 pi ip[383]: link/ether b8:27:eb:0d:ee:bb brd 
>> ff:ff:ff:ff:ff:ff
>> Feb 14 10:12:06 pi dhcpcd[385]: wlan0: starting wpa_supplicant
>> Feb 14 10:12:36 pi dhcpcd[385]: timed out
>> Feb 14 10:12:36 pi systemd[1]: Started dhcpcd on all interfaces.
>> Feb 14 10:12:37 pi systemd[1]: Stopping dhcpcd on all interfaces...
>> Feb 14 10:12:37 pi dhcpcd[519]: sending signal TERM to pid 466
>> Feb 14 10:12:37 pi dhcpcd[519]: waiting for pid 466 to exit
>> Feb 14 10:12:38 pi systemd[1]: dhcpcd.service: Succeeded.
>> Feb 14 10:12:38 pi systemd[1]: Stopped dhcpcd on all interfaces.
>> Feb 14 10:12:38 pi systemd[1]: Starting dhcpcd on all interfaces...
>> Feb 14 10:12:38 pi ip[524]: 2: eth0:  mtu 
>> 1500 qdisc pfifo_fast state DOWN group default qlen 1000
>> Feb 14 10:12:38 pi ip[524]: link/ether b8:27:eb:0d:ee:bb brd 
>> ff:ff:ff:ff:ff:ff
>> Feb 14 10:12:38 pi ip[529]: 2: eth0:  mtu 
>> 1500 qdisc pfifo_fast state UP group default qlen 1000
>> Feb 14 10:12:38 pi ip[529]: link/ether b8:27:eb:0d:ee:bb brd 
>> ff:ff:ff:ff:ff:ff
>> Feb 14 10:12:38 pi dhcpcd[530]: wlan0: starting wpa_supplicant
>> Feb 14 10:12:49 pi dhcpcd[530]: Too few arguments.
>> Feb 14 10:12:49 pi dhcpcd[530]: Too few arguments.
>> Feb 14 10:12:49 pi systemd[1]: Started dhcpcd on all interfaces.
>>
>>  (I deleted the "ip addr" output from the interfaces other than eth0 for 
>> brevity.)
>>
>> The interesting thing is surely that dhcpcd is being started twice. Assuming 
>> that was always happening then that suggests dhcpcd was bringing the network 
>> up early (and failing but leaving it in a "stuck" state) and then again 
>> later (where it was unable to recover from the first failure, but now can)?
>
>
> That's possible... but again, I don't see how it would get into this "stuck" 
> state in any other way but driver and/or hardware issues, as the kernel 
> driver is where the power-up sequence is done... dhcpcd (like 'ip link set 
> eth0 up') pretty much just tells the OS to power the NIC on, then waits.
>
> (My previous laptop had a Realtek Ethernet NIC that often wouldn't recognize 
> Ethernet link after suspend/resume until I removed it from the PCI bus... 
> took several kernel releases until they fixed that.)
>
> --
> Mantas Mikulėnas


Re: [systemd-devel] Systemd cgroup setup issue in containers

2023-09-29 Thread Lennart Poettering
On Fr, 29.09.23 10:53, Lewis Gaul (lewis.g...@gmail.com) wrote:

> Hi systemd team,
>
> I've encountered an issue when running systemd inside a container using
> cgroups v2, where if a container exec process is created at the wrong
> moment during early startup then systemd will fail to move all processes
> into a child cgroup, and therefore fail to enable controllers due to the
> "no internal processes" rule introduced in cgroups v2. In other words, a
> systemd container is started and very soon after a process is created via
> e.g. 'podman exec systemd-ctr cmd', where the exec process is placed in the
> container's namespaces (although not a child of the container's PID
> 1).

Yeah, joining into a container is really weird, it makes a process
appear from nowhere, possibly blocking resources, outside of the
resource or lifecycle control of the code in the container, outside of
any security restrictions and so on.

I personally think joining a container via joining the namespaces
(i.e. podman exec) might be OK for debugging, but it's not a good
default workflow. Unfortunately the problems with the approach are not
well understood by the container people.

In systemd's own container logic (i.e. systemd-nspawn + machinectl) we
hence avoid doing anything like this. "machinectl shell" and
related commands will instead talk to PID 1 in the container and ask it
to spawn something off, rather than doing so yourself.

Kinda related to this: util-linux' "unshare" tool (which can be used
to generically enter a container like this) also is pretty broken in
this regard btw, and I asked them to fix that, but nothing happened
there yet:

https://github.com/util-linux/util-linux/issues/2006

I'd advise "podman" and these things to never place joined processes
in the root cgroup of the container if they delegate cgroup access to
the container, because that really defeats the point. Instead they
should always join the cgroup of PID 1 in the container (which they
might already do I think), and if PID 1 is in the root cgroup, then
they should create their own subcgroup "/joined" or so, and put the
process in there, to not collide with the "no processes in inner
groups" rule of cgroupv2.

> This is not a totally crazy thing to be doing - this was hit when testing a
> systemd container, using a container exec "probe" to check when the
> container is ready.
>
> More precisely, the problem manifests as follows (in
> https://github.com/systemd/systemd/blob/081c50ed3cc081278d15c03ea54487bd5bebc812/src/core/cgroup.c#L3676
> ):
> - Container exec processes are placed in the container's root cgroup by
> default, but if this fails (due to the "no internal processes" rule) then
> container PID 1's cgroup is used (see
> https://github.com/opencontainers/runc/issues/2356).

This is a really bad idea. At the very least the rule should be
reversed (which would still be racy, but certainly better). But as
mentioned they should never put something in the root cgroup if cgroup
delegation is on.

> - At systemd startup, systemd tries to create the init.scope cgroup and
> move all processes into it.
> - If a container exec process is created after finding procs to move and
> moving them but before enabling controllers then the exec process will be
> placed in the root cgroup.
> - When systemd then tries to enable controllers via subtree_control in the
> container's root cgroup, this fails because the exec process is in that
> cgroup.
>
> The root of the problem here is that moving processes out of a cgroup and
> enabling controllers (such that new processes cannot be created there) is
> not an atomic operation, meaning there's a window where a new process can
> get in the way. One possible solution/workaround in systemd would be to
> retry under this condition. Or perhaps this should be considered a bug in
> the container runtimes?

Yes, that's what I think. They should fix that.

Lennart

--
Lennart Poettering, Berlin


Re: [systemd-devel] Systemd cgroup setup issue in containers

2023-09-29 Thread Lewis Gaul
>  Wouldn't it be better to have the container inform the host via
NOTIFY_SOCKET (the Type=notify mechanism)? I believe systemd has had
support for sending readiness notifications from init to a container
manager for quite a while.

> Use the notify socket and you'll get a notification back when the
container is ready, without having to inject anything

To be clear, I'm not looking for alternative solutions for my specific
example, I was raising the general architectural issue.

On Fri, 29 Sept 2023 at 12:06, Luca Boccassi 
wrote:

> On Fri, 29 Sept 2023 at 12:00, Lewis Gaul  wrote:
> >
> > Hi systemd team,
> >
> > I've encountered an issue when running systemd inside a container using
> cgroups v2, where if a container exec process is created at the wrong
> moment during early startup then systemd will fail to move all processes
> into a child cgroup, and therefore fail to enable controllers due to the
> "no internal processes" rule introduced in cgroups v2. In other words, a
> systemd container is started and very soon after a process is created via
> e.g. 'podman exec systemd-ctr cmd', where the exec process is placed in the
> container's namespaces (although not a child of the container's PID 1).
> This is not a totally crazy thing to be doing - this was hit when testing a
> systemd container, using a container exec "probe" to check when the
> container is ready.
>
> Use the notify socket and you'll get a notification back when the
> container is ready, without having to inject anything
>


Re: [systemd-devel] Systemd cgroup setup issue in containers

2023-09-29 Thread Luca Boccassi
On Fri, 29 Sept 2023 at 12:00, Lewis Gaul  wrote:
>
> Hi systemd team,
>
> I've encountered an issue when running systemd inside a container using 
> cgroups v2, where if a container exec process is created at the wrong moment 
> during early startup then systemd will fail to move all processes into a 
> child cgroup, and therefore fail to enable controllers due to the "no 
> internal processes" rule introduced in cgroups v2. In other words, a systemd 
> container is started and very soon after a process is created via e.g. 
> 'podman exec systemd-ctr cmd', where the exec process is placed in the 
> container's namespaces (although not a child of the container's PID 1). This 
> is not a totally crazy thing to be doing - this was hit when testing a 
> systemd container, using a container exec "probe" to check when the container 
> is ready.

Use the notify socket and you'll get a notification back when the
container is ready, without having to inject anything


Re: [systemd-devel] Systemd cgroup setup issue in containers

2023-09-29 Thread Mantas Mikulėnas
On Fri, Sep 29, 2023, 12:54 Lewis Gaul  wrote:

> Hi systemd team,
>
> I've encountered an issue when running systemd inside a container using
> cgroups v2, where if a container exec process is created at the wrong
> moment during early startup then systemd will fail to move all processes
> into a child cgroup, and therefore fail to enable controllers due to the
> "no internal processes" rule introduced in cgroups v2. In other words, a
> systemd container is started and very soon after a process is created via
> e.g. 'podman exec systemd-ctr cmd', where the exec process is placed in the
> container's namespaces (although not a child of the container's PID 1).
> This is not a totally crazy thing to be doing - this was hit when testing a
> systemd container, using a container exec "probe" to check when the
> container is ready.
>

Wouldn't it be better to have the container inform the host via
NOTIFY_SOCKET (the Type=notify mechanism)? I believe systemd has had
support for sending readiness notifications from init to a container
manager for quite a while.

(Alternatively, connect out to the container's systemd or dbus Unix socket
and query it directly that way, but NOTIFY_SOCKET would avoid the need to
time it correctly.)

Other than that – I'm not a container expert but this does seem like a
self-inflicted problem to me. If you spawn processes unknown to systemd, it
makes sense that systemd will fail to handle them.

>


[systemd-devel] Systemd cgroup setup issue in containers

2023-09-29 Thread Lewis Gaul
Hi systemd team,

I've encountered an issue when running systemd inside a container using
cgroups v2, where if a container exec process is created at the wrong
moment during early startup then systemd will fail to move all processes
into a child cgroup, and therefore fail to enable controllers due to the
"no internal processes" rule introduced in cgroups v2. In other words, a
systemd container is started and very soon after a process is created via
e.g. 'podman exec systemd-ctr cmd', where the exec process is placed in the
container's namespaces (although not a child of the container's PID 1).
This is not a totally crazy thing to be doing - this was hit when testing a
systemd container, using a container exec "probe" to check when the
container is ready.

More precisely, the problem manifests as follows (in
https://github.com/systemd/systemd/blob/081c50ed3cc081278d15c03ea54487bd5bebc812/src/core/cgroup.c#L3676
):
- Container exec processes are placed in the container's root cgroup by
default, but if this fails (due to the "no internal processes" rule) then
container PID 1's cgroup is used (see
https://github.com/opencontainers/runc/issues/2356).
- At systemd startup, systemd tries to create the init.scope cgroup and
move all processes into it.
- If a container exec process is created after finding procs to move and
moving them but before enabling controllers then the exec process will be
placed in the root cgroup.
- When systemd then tries to enable controllers via subtree_control in the
container's root cgroup, this fails because the exec process is in that
cgroup.

The root of the problem here is that moving processes out of a cgroup and
enabling controllers (such that new processes cannot be created there) is
not an atomic operation, meaning there's a window where a new process can
get in the way. One possible solution/workaround in systemd would be to
retry under this condition. Or perhaps this should be considered a bug in
the container runtimes?

I have some tests exercising systemd containers at
https://github.com/LewisGaul/systemd-containers which are able to reproduce
this issue on a cgroups v2 host (in testcase
tests/test_exec_procs.py::test_exec_proc_spam):

(venv) root@ubuntu:~/systemd-containers# pytest --log-cli-level debug -k
exec_proc_spam --cgroupns private --setup-modes default --container-exe
podman
INFO tests.conftest:conftest.py:474 Running container image
localhost/ubuntu-systemd:20.04 with args: entrypoint=, command=['bash',
'-c', 'sleep 1 && exec /sbin/init'], cap_add=['sys_admin'], systemd=always,
tty=True, interactive=True, detach=True, remove=False, cgroupns=private,
name=systemd-tests-1695981045.12
DEBUGtests.test_exec_procs:test_exec_procs.py:106 Got PID 1 cgroups:
0::/init.scope
DEBUGtests.test_exec_procs:test_exec_procs.py:111 Got exec proc 3
cgroups:
0::/init.scope
DEBUGtests.test_exec_procs:test_exec_procs.py:111 Got exec proc 21
cgroups:
0::/
DEBUGtests.test_exec_procs:test_exec_procs.py:114 Enabled controllers:
set()
=
short test summary info
=
FAILED
tests/test_exec_procs.py::test_exec_proc_spam[private-unified-default] -
AssertionError: assert set() >= {'memory', 'pids'}

Does anyone have any thoughts on this? Should this be considered a systemd
bug, or is it at least worth adding in some explicitly handling for this?
Is there something container runtimes are doing wrong here from the
perspective of systemd?

Thanks,
Lewis