Hi!
I just realized: this may well be fixed by Shepherd commit
a5d9a41aa0bfadea9617df4411b86c6ea0c525bd (see
<https://issues.guix.gnu.org/77373>.) Would be great to give it a try
somehow.
Anyway, some more data.
Yesterday I ran ‘herd stop nginx’ on bayfront (which was running 1.0.2),
again causing ‘herd status’ to hang afterwards. This time I captured an
strace log of PID 1.
The /var/log/messages showing what happens after ‘herd stop nginx’
--8<---------------cut here---------------start------------->8---
2025-04-08 11:28:27 localhost shepherd[1]: Stopping service
certbot-certificate-renewal...
2025-04-08 11:28:27 localhost shepherd[1]: Terminating timer
'certbot-certificate-renewal' with 0 processes running.
2025-04-08 11:28:27 localhost shepherd[1]: Service certbot-certificate-renewal
stopped.
2025-04-08 11:28:27 localhost shepherd[1]: Service certbot-certificate-renewal
is now stopped.
2025-04-08 11:28:27 localhost shepherd[1]: Stopping service nginx...
2025-04-08 11:28:27 localhost shepherd[1]: Registering new logger for nginx.
2025-04-08 11:28:28 localhost shepherd[1]: [nginx] nginx: [alert] could not
open error log file: open()
"/gnu/store/w2pi23rrk1dy95rzpb9l87gcj3c4jb38-nginx-1.27.3/logs/error.log"
failed (2: No such file or directory)
2025-04-08 11:28:29 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn]
12958#0: duplicate MIME type "text/html" in
/gnu/store/plpbgkr7wxl0qm4v0zf24vkvznvlc28y-nginx.conf:1043
2025-04-08 11:28:29 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn]
12958#0: duplicate MIME type "text/html" in
/gnu/store/plpbgkr7wxl0qm4v0zf24vkvznvlc28y-nginx.conf:1107
2025-04-08 11:28:29 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn]
12958#0: duplicate MIME type "text/html" in
/gnu/store/plpbgkr7wxl0qm4v0zf24vkvznvlc28y-nginx.conf:4100
2025-04-08 11:28:32 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn]
12958#0: conflicting server name "bayfront.guix.gnu.org" on 0.0.0.0:80, ignored
2025-04-08 11:28:32 localhost shepherd[1]: Service nginx might have failed to
stop.
2025-04-08 11:28:32 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn]
12958#0: conflicting server name "logs.guix.gnu.org" on 0.0.0.0:80, ignored
2025-04-08 11:28:32 localhost shepherd[1]: Service nginx is now stopped.
2025-04-08 11:28:32 localhost shepherd[1]: Accepted connection on 0.0.0.0:22
from 101.35.215.239:37996.
2025-04-08 11:29:40 localhost shepherd[1]: 2 connections still in use after
sshd-25913 termination.
2025-04-08 11:29:40 localhost shepherd[1]: Service sshd-25913 (PID 12257)
exited with 3.
2025-04-08 11:29:40 localhost shepherd[1]: Service sshd-25913 has been disabled.
2025-04-08 11:32:37 localhost shepherd[1]: Accepted connection on [::]:22 from
[2a0c:e300::58]:40428.
--8<---------------cut here---------------end--------------->8---
(The “might have failed to stop” bit is harmless and fixed in
3f4438741615aa1a1198f9415be8bd110138d83e.)
On berlin, which does not exhibit the bug, we get something similar (for
‘herd restart nginx’ here):
--8<---------------cut here---------------start------------->8---
2025-04-07 16:11:25 localhost shepherd[1]: Stopping service certbot-renewal...
2025-04-07 16:11:25 localhost shepherd[1]: Terminating timer 'certbot-renewal'
with 0 processes running.
2025-04-07 16:11:25 localhost shepherd[1]: Service certbot-renewal stopped.
2025-04-07 16:11:25 localhost shepherd[1]: Service certbot-renewal is now
stopped.
2025-04-07 16:11:26 localhost shepherd[1]: Stopping service nginx...
2025-04-07 16:11:26 localhost shepherd[1]: Registering new logger for nginx.
2025-04-07 16:11:26 localhost shepherd[1]: [nginx] nginx: [alert] could not
open error log file: open()
"/gnu/store/40mb76jc9qbm0d36qyncnpgrgsfab0db-nginx-1.27.3/logs/error.log"
failed (2: No such file or directory)
2025-04-07 16:11:26 localhost shepherd[1]: [nginx] 2025/04/07 16:11:26 [warn]
27359#0: duplicate MIME type "text/html" in
/gnu/store/s2yrd68wr9cd2iirc900491qgigyfg0z-nginx.conf:3593
2025-04-07 16:11:26 localhost shepherd[1]: [nginx] 2025/04/07 16:11:26 [notice]
27359#0: signal process started
2025-04-07 16:11:26 localhost shepherd[1]: Service nginx stopped.
2025-04-07 16:11:26 localhost shepherd[1]: Service nginx is now stopped.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning
anonip-/var/log/anonip/bootstrappable.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning
anonip-/var/log/anonip/issues-guix-gnu-org.https.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning
anonip-/var/log/anonip/qualif.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning
anonip-/var/log/anonip/disarchive.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning
anonip-/var/log/anonip/bootstrappable.https.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning
anonip-/var/log/anonip/dump-guix-gnu-org.https.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning
anonip-/var/log/anonip/workflows-guix-info.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning
anonip-/var/log/anonip/workflows-guix-info.https.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Starting service user-homes...
2025-04-07 16:11:26 localhost shepherd[1]: Service user-homes has been started.
2025-04-07 16:11:26 localhost shepherd[1]: Service user-homes started.
2025-04-07 16:11:26 localhost shepherd[1]: Service user-homes running with
value #t.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning
anonip-/var/log/anonip/https.access.log.
--8<---------------cut here---------------end--------------->8---
Unfortunately the strace log isn’t very helpful:
--8<---------------cut here---------------start------------->8---
11:28:27 read(54, "(shepherd-command (version 0) (action stop) (service nginx)
(arguments ()) (directory \"/home/ludo/src/maintenance/hydra\"))", 1024) = 122
<0.000162>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=326773810})
= 0 <0.000043>
11:28:27 epoll_wait(11, [], 16, 0) = 0 <0.000040>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327037318})
= 0 <0.000040>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327207449})
= 0 <0.000040>
11:28:27 epoll_wait(11, [], 16, 0) = 0 <0.000037>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327433802})
= 0 <0.000038>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327698776})
= 0 <0.000038>
11:28:27 epoll_wait(11, [], 16, 0) = 0 <0.000036>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327923662})
= 0 <0.000038>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=328114326})
= 0 <0.000038>
11:28:27 epoll_wait(11, [], 16, 0) = 0 <0.000038>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=328338724})
= 0 <0.000038>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=355846206})
= 0 <0.000069>
11:28:27 epoll_wait(11, [{events=EPOLLIN, data={u32=38, u64=38}}], 16, 0) = 1
<0.000045>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=356163002})
= 0 <0.000041>
11:28:27 read(38, "G", 1) = 1 <0.000049>
11:28:27 read(38, "E", 1) = 1 <0.000038>
11:28:27 read(38, "T", 1) = 1 <0.000038>
[…]
11:28:27 write(7, "Stopping service certbot-certificate-renewal...\n", 48) = 48
<0.000052>
[…]
11:28:27 write(25, "2025-04-08 11:28:27 localhost shepherd[1]: Service
certbot-certificate-renewal stopped.\n", 88) = 88 <0.000049>
[…]
11:28:27 write(24, "shepherd[1]: Stopping service nginx...\n", 39) = 39
<0.000050>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=560464210})
= 0 <0.000045>
11:28:27 epoll_wait(11, [{events=EPOLLIN, data={u32=17, u64=17}}], 16, 0) = 1
<0.000040>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=560715007})
= 0 <0.000042>
11:28:27 getuid() = 0 <0.000052>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=687926724})
= 0 <0.000040>
11:28:27 epoll_wait(11, [], 16, 0) = 0 <0.000038>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=688178499})
= 0 <0.000037>
11:28:27 rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8)
= 0 <0.000055>
11:28:27 pipe2([56, 59], O_CLOEXEC) = 0 <0.000060>
11:28:27 fcntl(56, F_GETFL) = 0 (flags O_RDONLY) <0.000038>
11:28:27 lseek(56, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) <0.000037>
11:28:27 fcntl(59, F_GETFL) = 0x1 (flags O_WRONLY) <0.000036>
11:28:27 lseek(59, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) <0.000036>
11:28:27 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 <0.000035>
11:28:27 clone(child_stack=NULL,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f01e6e99850) = 12958 <0.008960>
11:28:27 fcntl(56, F_GETFL) = 0 (flags O_RDONLY) <0.000040>
11:28:27 fcntl(56, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000039>
11:28:27 close(59) = 0 <0.000039>
11:28:27 rt_sigprocmask(SIG_SETMASK, [HUP INT TERM CHLD], [HUP INT TERM CHLD],
8) = 0 <0.000038>
11:28:27 recvfrom(17, "shepherd[1]: Stopping service nginx...\n", 1024, 0,
0x7ffc10c01010, [112 => 0]) = 39 <0.000060>
[…]
11:28:27 read(56, 0x7f01d9a634d0, 1) = -1 EAGAIN (Resource temporarily
unavailable) <0.000040>
11:28:27 epoll_ctl(11, EPOLL_CTL_MOD, 56,
{events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=56, u64=56}}) = -1 ENOENT
(No such file or directory) <0.000041>
11:28:27 epoll_ctl(11, EPOLL_CTL_ADD, 56,
{events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=56, u64=56}}) = 0 <0.000043>
11:28:27 write(7, "Registering new logger for nginx.\n", 34) = 34 <0.000048>
[…]
11:28:28 read(56, "n", 1) = 1 <0.000049>
11:28:28 read(56, "g", 1) = 1 <0.000046>
11:28:28 read(56, "i", 1) = 1 <0.000042>
11:28:28 read(56, "n", 1) = 1 <0.000040>
11:28:28 read(56, "x", 1) = 1 <0.000039>
11:28:28 read(56, ":", 1) = 1 <0.000039>
11:28:28 read(56, " ", 1) = 1 <0.000062>
11:28:28 read(56, "[", 1) = 1 <0.000039>
11:28:28 read(56, "a", 1) = 1 <0.000042>
11:28:28 read(56, "l", 1) = 1 <0.000042>
11:28:28 read(56, "e", 1) = 1 <0.000041>
11:28:28 read(56, "r", 1) = 1 <0.000044>
11:28:28 read(56, "t", 1) = 1 <0.000064>
11:28:28 read(56, "]", 1) = 1 <0.000043>
[…]
11:28:29 write(7, "[", 1) = 1 <0.000048>
11:28:29 write(7, "nginx", 5) = 5 <0.000035>
11:28:29 write(7, "] ", 2) = 2 <0.000034>
11:28:29 write(7, "2025/04/08 11:28:28 [warn] 12958#0: duplicate MIME type
\"text/html\" in /gnu/store/plpbgkr7wxl0qm4v0zf24vkvznvlc28y-nginx.conf:1043",
130) = 130 <0.000036>
11:28:29 write(7, "\n", 1) = 1 <0.000036>
[…]
11:28:32 accept4(57, {sa_family=AF_INET, sin_port=htons(37996),
sin_addr=inet_addr("101.35.215.239")}, [112 => 16], SOCK_CLOEXEC) = 54
<0.000058>
11:28:32 write(7, "Accepted connection on 0.0.0.0:22 from
101.35.215.239:37996.\n", 61) = 61 <0.000050>
11:28:32 fcntl(54, F_GETFL) = 0x2 (flags O_RDWR) <0.000037>
11:28:32 fcntl(54, F_SETFL, O_RDWR) = 0 <0.000037>
11:28:32 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104512, tv_nsec=512507676})
= 0 <0.000037>
11:28:32 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 <0.000035>
11:28:32 epoll_wait(11, [{events=EPOLLIN, data={u32=6, u64=6}}], 16, 273) = 1
<0.000039>
11:28:32 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104512, tv_nsec=512933004})
= 0 <0.000037>
[…]
11:28:52 accept4(16, 0x7ffc10c01010, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1
EAGAIN (Resource temporarily unavailable) <0.000057>
11:28:52 epoll_ctl(11, EPOLL_CTL_MOD, 16,
{events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=16, u64=16}}) = 0 <0.000042>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=448197354})
= 0 <0.000037>
11:28:52 epoll_wait(11, [], 16, 0) = 0 <0.000038>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=448434462})
= 0 <0.000036>
11:28:52 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444,
st_size=2962, ...}, 0) = 0 <0.000046>
11:28:52 write(41, "2025-04-08 11:28:52 39:4 17 (call-with-sigint _ _)\n",
56) = 56 <0.000105>
11:28:52 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444,
st_size=2962, ...}, 0) = 0 <0.000045>
11:28:52 write(41, "2025-04-08 11:28:52 In fibers.scm:\n", 35) = 35 <0.000058>
11:28:52 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444,
st_size=2962, ...}, 0) = 0 <0.000045>
11:28:52 write(19, "2025-04-08 11:28:52 localhost sudo: ludo : TTY=pts/9 ;
PWD=/home/ludo/src/maintenance/hydra ; USER=root ;
COMMAND=/run/current-system/profile/bin/herd status nginx\n", 168) = 168
<0.000062>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=450211551})
= 0 <0.000039>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=450387060})
= 0 <0.000039>
11:28:52 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444,
st_size=2962, ...}, 0) = 0 <0.000045>
11:28:52 write(19, "2025-04-08 11:28:52 localhost sudo: pam_unix(sudo:session):
session opened for user root(uid=0) by (uid=30012)\n", 111) = 111 <0.000051>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=450934120})
= 0 <0.000038>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=451064163})
= 0 <0.000059>
11:28:52 read(59, "(shepherd-command (version 0) (action status) (service
nginx) (arguments ()) (directory \"/home/ludo/src/maintenance/hydra\"))", 1024)
= 124 <0.000084>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=452217926})
= 0 <0.000040>
11:28:52 epoll_wait(11, [], 16, 0) = 0 <0.000044>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=452460901})
= 0 <0.000039>
--8<---------------cut here---------------end--------------->8---
At that point it was still accepting connections on
/var/run/shepherd/socket and reading them, reading process output and
logging it, but not much else.
For example, the “Accepted connection on …:22” messages do not have a
corresponding ‘clone’ call and not even “Starting service sshd-1234...”
as we’d normally see right after.
Ludo’.