Re: [systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-28 Thread Lennart Poettering
On Do, 28.01.21 10:08, Martin Wilck (mwi...@suse.com) wrote:

> Hi Lennart,
>
> thanks again.
>
> On Wed, 2021-01-27 at 23:56 +0100, Lennart Poettering wrote:
> > On Mi, 27.01.21 21:51, Martin Wilck (mwi...@suse.com) wrote:
> >
> > if you want the initrd environment to fully continue to exist,
>
> I don't. I just need /sys and /dev (and perhaps /proc and /run, too) to
> remain accessible. I believe most root storage daemons will need this.
>
> > consider creating a new mount namespace, bind mount the initrd root
> > into it recursively to some new dir you created. Then afterwards mark
> > that mount MS_PRIVATE. then pivot_root()+chroot()+chdir() into your
> > new old world.
>
> And on exit, I'd need to tear all that down again, right? I don't want
> my daemon to block shutdown because some file systems haven't been
> cleanly unmounted.

if you don't need the initrd root, i.e. don't intend to open any
further files, then you can just mount a an empty tmpfs to your
tempdir, mount proc/sys into it, then transition your process into it
and forget about the rest.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-28 Thread Martin Wilck
Hi Lennart,

thanks again.

On Wed, 2021-01-27 at 23:56 +0100, Lennart Poettering wrote:
> On Mi, 27.01.21 21:51, Martin Wilck (mwi...@suse.com) wrote:
> 
> if you want the initrd environment to fully continue to exist,

I don't. I just need /sys and /dev (and perhaps /proc and /run, too) to
remain accessible. I believe most root storage daemons will need this.

> consider creating a new mount namespace, bind mount the initrd root
> into it recursively to some new dir you created. Then afterwards mark
> that mount MS_PRIVATE. then pivot_root()+chroot()+chdir() into your
> new old world.

And on exit, I'd need to tear all that down again, right? I don't want
my daemon to block shutdown because some file systems haven't been
cleanly unmounted.

Regards,
Martin


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-27 Thread Martin Wilck
On Tue, 2021-01-26 at 11:33 +0100, Lennart Poettering wrote:
> 
> > 
> > [Unit]
> > Description=NVMe Event Monitor for Automatical Subsystem Connection
> > Documentation=man:nvme-monitor(1)
> > DefaultDependencies=false
> > Conflicts=shutdown.target
> > Requires=systemd-udevd-kernel.socket
> > After=systemd-udevd-kernel.socket
> 
> Why do you require this?
> 

Brain fart on my part. I need to connect to the kernel socket, but that
doesn't require the systemd unit.

> My guess: the socket unit gets shutdown, and since you have Requires=
> on it you thus go away too.

That was it, thanks a lot. So obvious in hindsight :-/

Meanwhile I've looked a bit deeper into the problems accessing "/dev"
that I talked about in my other post. scandir on "/" actually returns
an empty directory after switching root, and any path lookups for
absolute paths fail. I didn't expect that, because I thought systemd
removed the contents of the old root, and stopped on (bind) mounts.
Again, this is systemd-234.

If I chdir("/run") before switching root and chroot("..") afterwards
(*), I'm able to access everything just fine (**). However, if I do
this, I end up in the real root file system, which is what I wanted to
avoid in the first place.

So, I guess I'll have to create bind mounts for /dev, /sys etc. in the
old root, possibly after entering a private mount namespace?

The other option would be to save fd's for the file systems I need to
access and use opendirat() only. Right?

Regards,
Martin

(*) Michal suggested to simply do chroot(".") instead. That might as
well work, I haven't tried it yet.

(**) For notification about switching root, I used epoll(EPOLLPRI) on
/proc/self/mountinfo, because I read that inotify doesn't work on proc.
polling for EPOLLPRI works just fine.


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-27 Thread Lennart Poettering
On Mi, 27.01.21 21:51, Martin Wilck (mwi...@suse.com) wrote:

> Meanwhile I've looked a bit deeper into the problems accessing "/dev"
> that I talked about in my other post. scandir on "/" actually returns
> an empty directory after switching root, and any path lookups for
> absolute paths fail. I didn't expect that, because I thought systemd
> removed the contents of the old root, and stopped on (bind) mounts.
> Again, this is systemd-234.

Oh, right we actually use MS_MOVE to move the old /dev to the new
root. If you stay behind in the old you won't see anything anymore — it
got moved away.

Note that the switch root code also attempts to empty out the initrd
after the transition, or what's left of it. You might want to make the
initrd read-only if that is a problem to you.

> If I chdir("/run") before switching root and chroot("..") afterwards
> (*), I'm able to access everything just fine (**). However, if I do
> this, I end up in the real root file system, which is what I wanted to
> avoid in the first place.

Yes, this works the way it works, because /run is moved to the new
root, and thus if you chroot its parent you are in the new root.

> So, I guess I'll have to create bind mounts for /dev, /sys etc. in the
> old root, possibly after entering a private mount namespace?

if you want the initrd environment to fully continue to exist,
consider creating a new mount namespace, bind mount the initrd root
into it recursively to some new dir you created. Then afterwards mark
that mount MS_PRIVATE. then pivot_root()+chroot()+chdir() into your
new old world.

also, make the initrd superblock read-only, if you need its contents.

> The other option would be to save fd's for the file systems I need to
> access and use opendirat() only. Right?

That works too, if you can.
> (**) For notification about switching root, I used epoll(EPOLLPRI) on
> /proc/self/mountinfo, because I read that inotify doesn't work on proc.
> polling for EPOLLPRI works just fine.

Right, sorry. POLLPRI is the right API. inotify is used by cgroupfs
for similar notifications, and I mixed that up. for
/proc/self/mountinfo POLLPRI is the right choice.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-26 Thread Martin Wilck
On Tue, 2021-01-26 at 11:30 +0100, Lennart Poettering wrote:
> 
> > Imagine two parallel instances of systemd-udevd (IMO there are
> > reasons
> > to handle it like a "root storage daemon" in some distant future).
> 
> Hmm, wa? naahh.. udev is about dicovery it should not be required to
> maintain access to something you found.

True. But if udev ran without interruption, we could get rid of
coldplug after switching root. That could possibly save us a lot of
trouble.

Anyway, it's just a thought I find tempting.

Regrads
Martin


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-26 Thread Lennart Poettering
On Di, 26.01.21 13:30, Martin Wilck (mwi...@suse.com) wrote:

> On Tue, 2021-01-26 at 11:30 +0100, Lennart Poettering wrote:
> >
> > > Imagine two parallel instances of systemd-udevd (IMO there are
> > > reasons
> > > to handle it like a "root storage daemon" in some distant future).
> >
> > Hmm, wa? naahh.. udev is about dicovery it should not be required to
> > maintain access to something you found.
>
> True. But if udev ran without interruption, we could get rid of
> coldplug after switching root. That could possibly save us a lot of
> trouble.

And introduce new trouble. Usually the rules on the host are more
comprehensive than those in the initrd. You have to coldplug for the
bigger ruleset. If you want to avoid that you basically would have to
pack up a ton more stuff into the initrd.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-26 Thread Lennart Poettering
On Di, 26.01.21 01:19, Martin Wilck (mwi...@suse.com) wrote:

> On Mon, 2021-01-25 at 18:33 +0100, Lennart Poettering wrote:
> >
> > Consider using IgnoreOnIsolate=.
> >
>
> I fail to make this work. Installed this to the initrd (note the
> ExecStop "command"):



>
> [Unit]
> Description=NVMe Event Monitor for Automatical Subsystem Connection
> Documentation=man:nvme-monitor(1)
> DefaultDependencies=false
> Conflicts=shutdown.target
> Requires=systemd-udevd-kernel.socket
> After=systemd-udevd-kernel.socket

Why do you require this?

My guess: the socket unit gets shutdown, and since you have Requires=
on it you thus go away too.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-26 Thread Lennart Poettering
On Mo, 25.01.21 19:04, Martin Wilck (mwi...@suse.com) wrote:

> Is there any way for the daemon to get notified if root is switched?

/proc/self/mountinfo sends out notification events via inotify when
mounts are established/removed. I am pretty sure pivot_root() also
generates that. Your daemon could subscribe to that, and then recheck
each time if /etc/initrd-release is still accessible. Once you see
ENOENT on that you can assume the switch root took place, then close
the inotify.

> Would there be a potential security issue because the daemon keeps a
> reference to the intird root FS?

Modern initrds transition their own root to /run/initramfs anyway, so
this shouldn't be a problem normally.

> Imagine two parallel instances of systemd-udevd (IMO there are reasons
> to handle it like a "root storage daemon" in some distant future).

Hmm, wa? naahh.. udev is about dicovery it should not be required to
maintain access to something you found.

> > option two: if you cannot have multiple instances of your subsystem,
> > then the only option is to make the initrd version manage
> > everything. But of course, that sucks, but there's little one can do
> > about that.
>
> Why would it be so bad? I would actually prefer a single instance for
> most subsystems. But maybe I'm missing something.

Well, because you can't update things on-the-fly then, you cannot
reexec since everything is backed by initrd. You cannot restart
things, and so on.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-25 Thread Martin Wilck
On Mon, 2021-01-25 at 18:33 +0100, Lennart Poettering wrote:
> 
> Consider using IgnoreOnIsolate=.
> 

I fail to make this work. Installed this to the initrd (note the
ExecStop "command"):

[Unit]
Description=NVMe Event Monitor for Automatical Subsystem Connection
Documentation=man:nvme-monitor(1)
DefaultDependencies=false
Conflicts=shutdown.target
Requires=systemd-udevd-kernel.socket
After=systemd-udevd-kernel.socket
Before=sysinit.target systemd-udev-trigger.service 
nvmefc-boot-connections.service
RequiresMountsFor=/sys
IgnoreOnIsolate=true

[Service]
Type=simple
ExecStart=/usr/sbin/nvme monitor $NVME_MONITOR_OPTIONS
ExecStop=-/usr/bin/systemctl show -p IgnoreOnIsolate %N
KillMode=mixed

[Install]
WantedBy=sysinit.target

I verified (in a pre-pivot shell) that systemd had seen the
IgnoreOnIsolate property. But when initrd-switch-root.target is
isolated, the unit is cleanly stopped nonethless.

[  192.832127] dolin systemd[1]: initrd-switch-root.target: Trying to enqueue 
job initrd-switch-root.target/start/isolate
[  192.836697] dolin systemd[1]: nvme-monitor.service: Installed new job 
nvme-monitor.service/stop as 98
[  193.027182] dolin systemctl[3751]: IgnoreOnIsolate=yes
[  193.029124] dolin systemd[1]: nvme-monitor.service: Changed running -> 
stop-sigterm
[  193.029353] dolin nvme[768]: monitor_main_loop: monitor: exit signal received
[  193.029535] dolin systemd[1]: Stopping NVMe Event Monitor for Automatical 
Subsystem Connection...
[  193.065746] dolin systemd[1]: Child 768 (nvme) died (code=exited, 
status=0/SUCCESS)
[  193.065905] dolin systemd[1]: nvme-monitor.service: Child 768 belongs to 
nvme-monitor.service
[  193.066073] dolin systemd[1]: nvme-monitor.service: Main process exited, 
code=exited, status=0/SUCCESS
[  193.066241] dolin systemd[1]: nvme-monitor.service: Changed stop-sigterm -> 
dead
[  193.066403] dolin systemd[1]: nvme-monitor.service: Job 
nvme-monitor.service/stop finished, result=done
[  193.066571] dolin systemd[1]: Stopped NVMe Event Monitor for Automatical 
Subsystem Connection.
[  193.500010] dolin systemd[1]: initrd-switch-root.target: Job 
initrd-switch-root.target/start finished, result=done
[  193.500188] dolin systemd[1]: Reached target Switch Root.

After boot, the service actually remains running when isolating e.g. 
"rescue.target". But when switching root,
it doesn't work.

dolin:~/:[141]# systemctl show -p IgnoreOnIsolate nvme-monitor.service
IgnoreOnIsolate=yes

Tested only with systemd-234 so far. Any ideas what I'm getting wrong?

Martin


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-25 Thread Martin Wilck
On Mon, 2021-01-25 at 18:33 +0100, Lennart Poettering wrote:
> On Sa, 23.01.21 02:44, Martin Wilck (mwi...@suse.com) wrote:
> 
> > Hi
> > 
> > I'm experimenting with systemd's root storage daemon concept
> > (https://systemd.io/ROOT_STORAGE_DAEMONS/).
> > 
> > I'm starting my daemon from a service unit in the initrd, and
> > I set argv[0][0] to '@', as suggested in the text.
> > 
> > So far so good, the daemon isn't killed. 
> > 
> > But a lot more is necessary to make this actually *work*. Here's a
> > list
> > of issues I found, and what ideas I've had so far how to deal with
> > them. I'd appreciate some guidance.
> > 
> > 1) Even if a daemon is exempted from being killed by killall(), the
> > unit it belongs to will be stopped when initrd-switch-root.target
> > is
> > isolated, and that will normally cause the daemon to be stopped,
> > too.
> > AFAICS, the only way to ensure the daemon is not killed is by
> > setting
> > "KillMode=none" in the unit file. Right? Any other mode would send
> > SIGKILL sooner or later even if my daemon was smart enough to
> > ignore
> > SIGTERM when running in the intird.
> 
> Consider using IgnoreOnIsolate=.

Ah, thanks a lot. IIUC that would actually make systemd realize that
the unit continues to run after switching root, which is good.

Like I remarked for KillMode=none, IgnoreOnIsolate=true would be
suitable only for the "root storage daemon" instance, not for a
possible other instance serving data volumes only.
I suppose there's no way to make this directive conditional on being
run from the initrd, so I'd need two different unit files,
or use a drop-in in the initrd.

Is there any way for the daemon to get notified if root is switched?

> 
> > 3) The daemon that has been started in the initrd's root file
> > system
> > is unable to access e.g. the /dev file system after switching
> > root. I haven't yet systematically analyzed which file systems are
> > available.   I suppose this must be handled by creating bind
> > mounts,
> > but I need guidance how to do this. Or would it be
> > possible/advisable for the daemon to also re-execute itself under
> > the real root, like systemd itself? I thought the root storage
> > daemon idea was developed to prevent exactly that.
> 
> Not sure why it wouldn't be able to access /dev after switching. We
> do
> not allocate any new instance of that, it's always the same devtmpfs
> instance.

I haven't digged deeper yet, I just saw "No such file or directory"
error messages trying to access device nodes that I knew existed, so I
concluded there were issues with /dev.

> Do not reexec onto the host fs, that's really not how this should be
> done.

Would there be a potential security issue because the daemon keeps a
reference to the intird root FS?

> 
> > 4) Most daemons that might qualify as "root storage daemon" also
> > have
> > a "normal" mode, when the storage they serve is _not_ used as root
> > FS,
> > just for data storage. In that case, it's probably preferrable to
> > run
> > them from inside the root FS rather than as root storage daemon.
> > That
> > has various advantages, e.g. the possibility to update the sofware
> > without rebooting. It's not clear to me yet how to handle the two
> > options (root and non-root) cleanly with unit files.
> 
> option one: have two unit files? i.e. two instances of the subsystem,
> one managing the root storage, and one the rest.

Hm, that looks clumsy to me. It could be done e.g. for multipath by
using separate configuration files and setting up appropriate
blacklists, but it would cause a lot of work to be done twice. e.g.
uevents would be received by both daemons and acted upon
simultaneously. Generally ruling out race conditions wouldn't be easy.

Imagine two parallel instances of systemd-udevd (IMO there are reasons
to handle it like a "root storage daemon" in some distant future).

> option two: if you cannot have multiple instances of your subsystem,
> then the only option is to make the initrd version manage
> everything. But of course, that sucks, but there's little one can do
> about that.

Why would it be so bad? I would actually prefer a single instance for
most subsystems. But maybe I'm missing something.

Thanks,
Martin

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-25 Thread Lennart Poettering
On Sa, 23.01.21 02:44, Martin Wilck (mwi...@suse.com) wrote:

> Hi
>
> I'm experimenting with systemd's root storage daemon concept
> (https://systemd.io/ROOT_STORAGE_DAEMONS/).
>
> I'm starting my daemon from a service unit in the initrd, and
> I set argv[0][0] to '@', as suggested in the text.
>
> So far so good, the daemon isn't killed. 
>
> But a lot more is necessary to make this actually *work*. Here's a list
> of issues I found, and what ideas I've had so far how to deal with
> them. I'd appreciate some guidance.
>
> 1) Even if a daemon is exempted from being killed by killall(), the
> unit it belongs to will be stopped when initrd-switch-root.target is
> isolated, and that will normally cause the daemon to be stopped, too.
> AFAICS, the only way to ensure the daemon is not killed is by setting
> "KillMode=none" in the unit file. Right? Any other mode would send
> SIGKILL sooner or later even if my daemon was smart enough to ignore
> SIGTERM when running in the intird.

Consider using IgnoreOnIsolate=.

> 3) The daemon that has been started in the initrd's root file system
> is unable to access e.g. the /dev file system after switching
> root. I haven't yet systematically analyzed which file systems are
> available.   I suppose this must be handled by creating bind mounts,
> but I need guidance how to do this. Or would it be
> possible/advisable for the daemon to also re-execute itself under
> the real root, like systemd itself? I thought the root storage
> daemon idea was developed to prevent exactly that.

Not sure why it wouldn't be able to access /dev after switching. We do
not allocate any new instance of that, it's always the same devtmpfs
instance.

Do not reexec onto the host fs, that's really not how this should be
done.

> 4) Most daemons that might qualify as "root storage daemon" also have
> a "normal" mode, when the storage they serve is _not_ used as root FS,
> just for data storage. In that case, it's probably preferrable to run
> them from inside the root FS rather than as root storage daemon. That
> has various advantages, e.g. the possibility to update the sofware
> without rebooting. It's not clear to me yet how to handle the two
> options (root and non-root) cleanly with unit files.

option one: have two unit files? i.e. two instances of the subsystem,
one managing the root storage, and one the rest.

option two: if you cannot have multiple instances of your subsystem,
then the only option is to make the initrd version manage
everything. But of course, that sucks, but there's little one can do
about that.

Lennart

--
Lennart Poettering, Berlin
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] Questions about systemd's "root storage daemon" concept

2021-01-22 Thread Martin Wilck
Hi

I'm experimenting with systemd's root storage daemon concept
(https://systemd.io/ROOT_STORAGE_DAEMONS/).

I'm starting my daemon from a service unit in the initrd, and
I set argv[0][0] to '@', as suggested in the text.

So far so good, the daemon isn't killed. 

But a lot more is necessary to make this actually *work*. Here's a list
of issues I found, and what ideas I've had so far how to deal with
them. I'd appreciate some guidance.

1) Even if a daemon is exempted from being killed by killall(), the
unit it belongs to will be stopped when initrd-switch-root.target is
isolated, and that will normally cause the daemon to be stopped, too. 
AFAICS, the only way to ensure the daemon is not killed is by setting
"KillMode=none" in the unit file. Right? Any other mode would send
SIGKILL sooner or later even if my daemon was smart enough to ignore
SIGTERM when running in the intird.

2) KillMode=none will make systemd consider the respective unit
stopped, even if the daemon is still running. That feels wrong. Are
there better options?

3) The daemon that has been started in the initrd's root file system is
unable to access e.g. the /dev file system after switching root. I
haven't yet systematically analyzed which file systems are available. 
I suppose this must be handled by creating bind mounts, but I need
guidance how to do this. Or would it be possible/advisable for the
daemon to also re-execute itself under the real root, like systemd
itself? I thought the root storage daemon idea was developed to prevent
exactly that.

4) Most daemons that might qualify as "root storage daemon" also have
a "normal" mode, when the storage they serve is _not_ used as root FS,
just for data storage. In that case, it's probably preferrable to run
them from inside the root FS rather than as root storage daemon. That
has various advantages, e.g. the possibility to update the sofware
without rebooting. It's not clear to me yet how to handle the two
options (root and non-root) cleanly with unit files. 

 - if (for "root storage daemon" mode) I simply put the enabled unit
file in the initrd, systemd will start the daemon twice, at least if
it's a "simple" service. I considered working with conditions, such as 

   ConditionPathExists=!/run/my-daemon/my-pidfile

(where the pidfile would have been created by the initrd-based daemon)
but that would cause the unit in the root FS to fail, which is ugly.

 - I could (for root mode) add the enabled unit file to the intird
and afterwards disable it in the root fs, thus avoiding two copies to
be started. But that would cause issues whenever the intird must be
rebuilt. I suppose it could be handled with a dracut module.

- I could create two different unit files mydaemon.service and
mydaemon-initrd.service and have them conflict. dracut doesn't support
this out of the box. A separate dracut module would be necessary, too.

- Some settings such as KillMode=none make sense for the service in the
intird environment, but not for the one running in the root FS, and
vice versa. This is another argument for having separate unit files, or
initrd-specific drop-ins.

Bottom line for 4) is that a dracut module specific to the daemon at
hand must be written. That dracut module would need to figure out
whether the service is required for mounting root, and activate "root-
storage-daemon" mode by adding the service to the intird. The instance
in the root FS would then either need be disabled, or be smart enough
to detect situation and exit gracefully. Ideally, "systemctl status"
would show the service as running even thought the instance inside the
root FS isn't actually running. I am unsure if all this can be achieved
easily with the current sytemd functionality, please advise.

I hope this makes at least some sense.

Suggestions and Feedback welcome.

Regards
Martin




___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel