Hi, Maxim Cournoyer <maxim.courno...@gmail.com> skribis:
>>> + (info (G_ "Testing ~a build machines defined in '~a'...~%") >>> (length machines) machine-file) >>> - (let* ((names (map build-machine-name machines)) >>> - (sockets (map build-machine-daemon-socket machines)) >>> - (sessions (map (cut open-ssh-session <> %short-timeout) >>> machines)) >>> - (nodes (map remote-inferior sessions))) >>> - (for-each assert-node-has-guix nodes names) >>> - (for-each assert-node-repl nodes names) >>> - (for-each assert-node-can-import sessions nodes names sockets) >>> - (for-each assert-node-can-export sessions nodes names sockets) >>> - (for-each close-inferior nodes) >>> - (for-each disconnect! sessions)))) >>> + (par-for-each check-machine-availability machines))) >> >> Why not! IMO this should go in a separate patch, though, since it’s not >> related. > > For me, it is related in that retrying all the checks of *every* build > offload machine would be too expensive; it already takes 32 s for my 4 > offload machines; retrying this for up to 3 times would mean waiting for > a minute and half, which I don't find reasonable (imagine on berlin!). I see. So I’d say it’s a prerequisite (a patch that must come before) but not entirely the same thing. I’m nitpicking! We should make sure it doesn’t trigger thread-safety issues in libssh or anything like that (running it repeatedly on a large machines.scm should give us some confidence). >>> +(define (check-machine-availability machine) >>> + "Check whether MACHINE is available. Exit with an error upon failure." >>> + ;; Sometimes, the machine remote port may return EOF, presumably because >>> the >>> + ;; connection was lost. Retry up to 3 times. >>> + (let loop ((retries 3)) >>> + (guard (c ((inferior-connection-lost? c) >>> + (let ((retries-left (1- retries))) >>> + (if (> retries-left 0) >>> + (begin >>> + (format (current-error-port) >>> + (G_ "connection to machine ~s lost; >>> retrying~%") >>> + (build-machine-name machine)) >>> + (loop (retries-left))) >>> + (leave (G_ "connection repeatedly lost with machine >>> '~a'~%") >>> + (build-machine-name machine)))))) >> >> I’m afraid we’re papering over problems here. > > I had that thought too, but then also realized that even if this was > papering over a problem, it'd be a good one to paper over as this > problem can legitimately happen in practice, due to the network's > inherently shaky nature. It seems better to be ready for it. Also, my > hopes in being able to troubleshoot such a difficult to reproduce > networking issue are rather low. Yes, but note that this is just for ‘guix offload test’. The actual code run while offloading will still fail badly. >> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on >> berlin enough to reproduce the issue? If so, we could monitor/strace >> sshd on overdrive1 to get a better understanding of what’s going on. > > It's actually difficult to trigger it; it seems to happen mostly on the > first try after a long time without connecting to the machine; on the > 2nd and later tries, everything is smooth. Waiting a few minutes is not > enough to re-trigger the problem. > > I've managed to see the problem a few lucky times with: > > while true; do guix offload test /etc/guix/machines.scm overdrive1; done > > I don't have a password set for my user on overdrive1, so can't attach > strace to sshd, but yeah, we could try to capture it and see if we can > understand what's going on. OK. > From c52172502749a4d194dc51db9d2c394cb15e8d07 Mon Sep 17 00:00:00 2001 > From: Maxim Cournoyer <maxim.courno...@gmail.com> > Date: Tue, 25 May 2021 08:42:06 -0400 > Subject: [PATCH] offload: Handle a possible EOF response from > read-repl-response. > > Fixes <https://issues.guix.gnu.org/41625>. > > * guix/scripts/offload.scm (check-machine-availability): Refactor so that it > takes a single machine object, to allow for retrying a single machine. Handle > the case where the checks raised an exception due to the connection to the > build machine having been lost, and retry up to 3 times. Ensure the cleanup > code is run in all situations. > (check-machines-availability): New procedure. Call > CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about > twice as fast with 4 build machines, from ~30 s to ~15 s). > * guix/inferior.scm (&inferior-connection-lost): New condition type. > (read-repl-response): Raise a condition of the above type when reading EOF > from the build machine's port. [...] > +(define-condition-type &inferior-connection-lost &error > + inferior-connection-lost?) Perhaps worth adding an ‘inferior’ and/or ‘port’ field. That would allow the handler to present more information as to which inferior is failing. Maybe ‘premature-eof’ would be more accurate than ‘connection-lost’. > + (format (current-error-port) > + (G_ "connection to machine '~a' lost; > retrying~%") > + (build-machine-name machine)) You can use ‘info’ instead of ‘format’. Otherwise LGTM, thanks! Ludo’.