bug#65178: Shepherd stops responding during "guix system reconfigure"

2023-12-15 Thread Timo Wilken
On Fri Dec 15, 2023 at 8:47 PM CET, Attila Lendvai wrote:
> i think i have found the root cause of this, as documented here: 
> https://issues.guix.gnu.org/67839
>
> that issue contains patches for shepherd to reproduce it in its test suite.

Thank you very much for this, Attila!

Are the patch in 67839 and/or your branch "attila" linked from there in a
state that I could test them locally? Would it be valuable to you if I ran a
patched Shepherd and sent logs and/or backtraces as I encountered them?





bug#65178: Shepherd stops responding during "guix system reconfigure"

2023-12-15 Thread Attila Lendvai
i think i have found the root cause of this, as documented here: 
https://issues.guix.gnu.org/67839

that issue contains patches for shepherd to reproduce it in its test suite.

-- 
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“What divides libertarians from everybody else is not a belief about rights or 
what rights people have, because the judgments libertarians make about the 
state are the same as the judgments almost everyone makes about private agents. 
So it's not that we believe in rights that other people don't believe in, or 
that other people believe in rights that we don't believe in. It's that other 
people think the state is exempt from the moral principles that apply to 
non-government agents.”
— Michael Huemer






bug#65178: Shepherd stops responding during "guix system reconfigure"

2023-12-14 Thread Timo Wilken
After a bit of searching, it looks like 67538, 67230 and 65178 may be the same
issue.

Attila Lendvai wrote:
> > > my suspicion is that it's due to some error coming from a start
> > > GEXP that somehow derails shepherd's event loop.
> >
> > iirc I once managed to get a debugger out when it happened and it's
> > stuck waiting in one of the epoll/select/alike calls,
>
> ...or one of the start/stop GEXP's calls something that (sometimes?) blocks
> indefinitely (which violates the API of shepherd).

Same symptoms here again.

For context: this time I was trying to deploy some OCI/Docker containers using
Guix' `oci-container-service-type', specifically a Shepherd service called
"conduit". My code is here:

https://cgit.twilken.net/dotfiles/log/?h=matrix-containers

(Specifically, commits bf94f7872a1df293bd904bbd2c1ef7229f4f98a8 and
c87dcdae79c6266ac3dac70af08fbef5eb21629b.)

This is with Guix commit 1b2505217cf222d98cc960b8510660976a01cfa1.

I first ran "guix system reconfigure -L . tw/system/lud.scm" with commit
bf94f7872a1df293bd904bbd2c1ef7229f4f98a8, which had a bug (an env var was
wrong, so the container failed to start). This worked as expected in that
Shepherd tried to start the service, which failed, so Shepherd disabled it.

Then, I fixed the env var and re-ran "guix system reconfigure -L .
tw/system/lud.scm" with commit c87dcdae79c6266ac3dac70af08fbef5eb21629b.
Shepherd loaded the new "conduit" service fine, as far as I can tell, but
didn't restart it because it was still disabled.

I then enabled and started the service manually. Enabling worked fine, but on
start, I got no terminal output from Shepherd, and it hung.

I still had an error in my setup (directory permissions were wrong), and I got
a message in /var/log/messages to that effect:

--8<---cut here---start->8---
Dec 14 21:33:50 localhost shepherd[1]: Service conduit is currently disabled. 
Dec 14 21:34:04 localhost shepherd[1]: Enabled service conduit. 
Dec 14 21:34:07 localhost shepherd[1]: Starting service user-homes... 
Dec 14 21:34:07 localhost shepherd[1]: Service user-homes has been started. 
Dec 14 21:34:07 localhost shepherd[1]: Service user-homes started. 
Dec 14 21:34:07 localhost shepherd[1]: Service user-homes running with value 
#t. 
Dec 14 21:34:07 localhost shepherd[1]: Starting service conduit... 
Dec 14 21:34:07 localhost shepherd[1]: Service conduit has been started. 
Dec 14 21:34:07 localhost shepherd[1]: Service conduit started. 
Dec 14 21:34:07 localhost shepherd[1]: Service conduit running with value 
13226. 
Dec 14 21:34:07 localhost shepherd[1]: [docker] conduit: [...] "IO error: While 
open a file for appending: /var/lib/matrix-conduit/LOG: Permission denied"
--8<---cut here---end--->8---

...showing that Shepherd had at least tried to start the new container. The
container is not running, though (due to the error shown above), and nothing
with PID 13226 is running.

The "herd start conduit" command did not return, and ^C-ing it did not help.
Afterwards, every "herd" command also hung without any output.

Here are the last four lines of the output of "sudo strace -s1000 herd status"
on such a hung machine:

--8<---cut here---start->8---
connect(10, {sa_family=AF_UNIX, sun_path="/var/run/shepherd/socket"}, 26) = 0
getcwd("/home/timo", 100)   = 11
write(10, "(shepherd-command (version 0) (action status) (service root) 
(arguments ()) (directory \"/home/timo\"))", 101) = 101
read(10,
--8<---cut here---end--->8---

The "read(10, " call never completes.

At least in this case, Shepherd still seems to be processing inbound inet
connections, so I can open new SSH connections to the machine.

Attaching to PID 1 with strace shows it is stuck in "epoll_wait(13, "
(unsurprisingly, fd 13 points to "anon_inode:[eventpoll]"). Here's a backtrace
of all threads in "gdb -p 1":

--8<---cut here---start->8---
(gdb) info threads
  Id   Target Id Frame 
* 1Thread 0x7f786544c380 (LWP 1) "shepherd"  0x7f7865552626 in 
epoll_wait ()
   from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
  2Thread 0x7f7864e16640 (LWP 186) "GC-marker-0" 0x7f78654cf16a in 
__futex_abstimed_wait_common ()
   from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
  3Thread 0x7f7864615640 (LWP 187) "GC-marker-1" 0x7f78654cf16a in 
__futex_abstimed_wait_common ()
   from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
  4Thread 0x7f7863e14640 (LWP 188) "GC-marker-2" 0x7f78654cf16a in 
__futex_abstimed_wait_common ()
   from /gnu/store/ln6hxqjvz6m9gdd9s97pivlqck7hzs99-glibc-2.35/lib/libc.so.6
  5Thread 0x7f78634c6640 (LWP 190) "shepherd"0x7f786554300c in read 
()
   from