Hi Ludo,

That is not a fix.  It's a workaround for now.

It's good that the "is a shepherd already running" check is back in shepherd.  
It was in shepherd years ago, then got removed without explanation, then now 
it's back again (now in a very convoluted but safer way).  This shouldn't have 
been removed in the first place.  It's EXTREMELY dangerous to have multiple 
parallel shepherds for the same user (automated backup service destroying 
backups etc).  Please, let's not remove it ever again.

In any case, what shepherd 1.0.4 does is stop the bleeding, but not fix the 
problem:
It prevents two (or 100) user shepherds for the same user from running in 
parallel.
It does not stop shepherd when a user closed all their sessions.

Why close this bug report before elogind is patched and before ~/.bash_logout 
is generated in guix home?  That makes no sense.

Also, I don't understand why this is so broken for so long.  Isn't Guix used in 
HPC?
Doesn't HPC need support for multiple sessions for the same user on day one?

My untested elogind patch that invokes shepherd root stop is attached.  Reading 
the elogind source code, especially what they patched out and what they added 
themselves, makes me despair.  Why is it so terrible?  That all used to be 
fine! :P

Even my patch is not great.  A service manager's job is to manage services.  
PID 1 is the main service manager.  It should manage services.  One of those 
services should be the user's shepherd, which should be managed by PID 1 
shepherd and not weirdly attached to an already-running session (WTF!) of the 
user by this:

~$ cat ~/.profile
HOME_ENVIRONMENT=$HOME/.guix-home
. $HOME_ENVIRONMENT/setup-environment
$HOME_ENVIRONMENT/on-first-login
unset HOME_ENVIRONMENT

In my opinion, no one but the service manager should manage services.  Does 
~/.profile look like a service manager?  No :P

I understand that we want to support this on non-guix-system stuff.  But the 
default should be a systemd user service to run the user shepherd.  If the user 
absolutely wants to do a workaround like ~/.profile above, fine, they can.  But 
let's not do that by default.

The problems with my elogind patch are the following:
- What if "herd stop root -s ..." hangs?  Then elogind hangs forever?  No one 
can log in or out anymore?  That's not okay.  Therefore, I don't wait.  Now 
user processes can have the floor upon they are walking removed on user stop, 
while they still need it :P
- When can /run/user/1000 be deleted?  There's a weird GC mechanism in elogind 
for that, and my patch says it can be deleted before waiting on the result of 
herd stop (see above why).  If I DID wait on the result of herd stop, I could 
wait indefinitely--which is not okay.  I think elogind uses signalfd, so I 
can't waitpid in a random spot either, or wait until waitpid returned.  I think 
the user shepherd knows when to delete /run/user/1000--and no one else.  But if 
user shepherd crashes, it won't delete /run/user/1000 and we want it to be able 
to start again even when /run/user/1000 is still there.  Hence complicated 
shepherd fix in 1.0.4 is useful.
- There is tool_fork_pid and sleep_fork_pid in elogind which is not a queue.  
And, again, that is trying to be a service manager.  What if those scripts 
hang?  What if they DON'T hang?  Similar questions as before.  Separate the 
concerns already :P

Personally, I'd also like something that, if all sessions of user x are closed, 
it kills all remaining processes of that effective user id.  elogind has a 
setting KillUserProcesses that--despite the name--kills (WHICH!?) processes 
when a SESSION (of 42 sessions of that user :P) is closed.  Who wants THAT?  
And even if someone does: how would THAT be implemented?

elogind is like containers never happened.  It's so weird.

I think to fix this problem for good, first there needs to be a system diagram 
created on how this is supposed to work.

License: elogind's license
Author: Danny Milosavljevic <[email protected]>
Date: 2025-05-18

diff -ru orig/18rk21n7l3yniy1rvlcdnwgnnvafivf0-elogind-255.17-checkout/src/login/logind-user.c 18rk21n7l3yniy1rvlcdnwgnnvafivf0-elogind-255.17-checkout/src/login/logind-user.c
--- orig/18rk21n7l3yniy1rvlcdnwgnnvafivf0-elogind-255.17-checkout/src/login/logind-user.c	2025-05-10 13:54:28.999814332 +0200
+++ 18rk21n7l3yniy1rvlcdnwgnnvafivf0-elogind-255.17-checkout/src/login/logind-user.c	2025-05-10 15:48:33.872775240 +0200
@@ -2,6 +2,7 @@
 
 #include <errno.h>
 #include <unistd.h>
+#include <spawn.h>
 
 #include "alloc-util.h"
 //#include "bus-common-errors.h"
@@ -17,6 +18,7 @@
 #include "format-util.h"
 #include "fs-util.h"
 #include "hashmap.h"
+#include "string-util.h"
 // #include "label-util.h"
 #include "limits-util.h"
 #include "logind-dbus.h"
@@ -506,24 +508,45 @@
         return 0;
 }
 
-#if 0 /// elogind does not support user services and systemd units
 static void user_stop_service(User *u, bool force) {
-        _cleanup_(sd_bus_error_free) sd_bus_error error = SD_BUS_ERROR_NULL;
-        int r;
-
-        assert(u);
-        assert(u->service);
-
-        /* The reverse of user_start_service(). Note that we only stop [email protected] here, and let StopWhenUnneeded=
-         * deal with the slice and the [email protected] instance. */
-
-        u->service_job = mfree(u->service_job);
-
-        r = manager_stop_unit(u->manager, u->service, force ? "replace" : "fail", &error, &u->service_job);
-        if (r < 0)
-                log_warning_errno(r, "Failed to stop user service '%s', ignoring: %s", u->service, bus_error_message(&error, r));
+	assert(u);
+	if (u->runtime_path != NULL) {
+		pid_t pid;
+		/* TODO: maybe just /run/booted-system/profile/bin/pkill -u u->user_record->uid ;
+		TODO: maybe just loginctl kill-user x; maybe that's us.
+		That eventually calls user_kill, which elogind patched to not kill the user
+		service or, really, do anything useful.
+		u->slice would be the unit name if it worked.
+		Note: u->user_record->kill_processes is for sessions, not users.
+		See also user_unit_active. */
+		const char *executable = "/run/booted-system/profile/bin/herd";
+		char* socket_path_arg = strjoina(u->runtime_path, "/shepherd/socket");
+		char *argv[] = {
+			(char *) executable,
+			"stop",
+			"root",
+			"-s",
+			socket_path_arg,
+			NULL,
+		};
+		int spawn_status = posix_spawn(&pid, executable, NULL, NULL, argv, environ);
+		if (spawn_status != 0) {
+			log_error_errno(spawn_status, "Failed to invoke %s: %m", executable);
+		} else {
+			if (u->manager != NULL) {
+				/* TODO: Do we overwrite someone here? */
+				u->manager->tool_fork_pid = pid;
+				/* elogind_sigchld_handler unsets it.  Not sure how we'd notice.
+				Note: elogind patched out the service_job handling which means
+				that user_may_gc will return true as soon as u->stopping == true
+				instead of checking whether the user service is still running. */
+			} else {
+				/* ??? */
+			}
+		}
+	}
+	user_add_to_gc_queue(u);
 }
-#endif // 0
 
 int user_stop(User *u, bool force) {
         int r = 0;
@@ -552,11 +575,7 @@
                         r = k;
         }
 
-#if 0 /// elogind does not support service or slice jobs...
         user_stop_service(u, force);
-#else // 0
-        user_add_to_gc_queue(u);
-#endif // 0
 
         u->stopping = true;
 

Reply via email to