Hi, I have not checked all the details, since the code of “guix offload” is run by root, IIUC and so it is not as friendly as usual to debug. :-)
On Fri, 17 Dec 2021 at 16:57, Maxim Cournoyer <[email protected]> wrote: >> However, I think this behavior was unintentionally lost in >> efbf5fdd01817ea75de369e3dd2761a85f8f7dd5. Maxim, WDYT? > > I just reviewed this commit, and don't see anywhere where the behavior > would have changed. The discarding happens here: [...] > previously load could be set to +inf.0. Now it is a float between 0.0 > and 1.0, with threshold defaulting to 0.6. My /etc/guix/machines.scm contains only one machine and --max-jobs=0. Because the machine is unreachable, IIUC, ’node’ is (or should be) false and ’load’ is thus not involved, I guess. Indeed, ’report-load’ displays nothing, and instead I get: --8<---------------cut here---------------start------------->8--- The following derivation will be built: /gnu/store/c1qicg17ygn1a0biq0q4mkprzy4p2x74-hello-2.10.drv process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0' guix offload: error: failed to connect to 'x.x.x.x': Timeout connecting to x.x.x.x waiting for locks or build slots... process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0' guix offload: error: failed to connect to 'x.x.x.x': Timeout connecting to x.x.x.x process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0' guix offload: error: failed to connect to 'x.x.x.x': Timeout connecting to x.x.x.x process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0' guix offload: error: failed to connect to 'x.x.x.x': Timeout connecting to x.x.x.x process 75621 acquired build slot '/var/guix/offload/x.x.x.x:22/0' C-c C-c --8<---------------cut here---------------end--------------->8--- Well, if the machine is not reachable, then ’session’ is false, right? --8<---------------cut here---------------start------------->8--- @@ -472,11 +480,15 @@ (define (machine-faster? m1 m2) (let* ((session (false-if-exception (open-ssh-session best %short-timeout))) (node (and session (remote-inferior session))) - (load (and node (normalized-load best (node-load node)))) + (load (and node (node-load node))) + (threshold (build-machine-overload-threshold best)) (space (and node (node-free-disk-space node)))) + (when load (report-load best load)) (when node (close-inferior node)) (when session (disconnect! session)) - (if (and node (< load 2.) (>= space %minimum-disk-space)) + (if (and node + (or (not threshold) (< load threshold)) + (>= space %minimum-disk-space)) [...] (begin ;; BEST is unsuitable, so try the next one. (when (and space (< space %minimum-disk-space)) (format (current-error-port) "skipping machine '~a' because it is low \ on disk space (~,2f MiB free)~%" (build-machine-name best) (/ space (expt 2 20) 1.))) (release-build-slot slot) (loop others))))) --8<---------------cut here---------------end--------------->8--- Therefore, the ’else’ branch goes and so the codes does ’(loop others)’. However, I miss why ’others’ is not empty (only one machine in /etc/guix/machines.scm). Well, the message «waiting for locks or build slots...» suggests that something is restarted and it is not that ’loop’ we are observing but another one. On daemon side, I do not know what this ’waitingForAWhile’ and ’lastWokenUp’ mean. --8<---------------cut here---------------start------------->8--- /* If we are polling goals that are waiting for a lock, then wake up after a few seconds at most. */ if (!waitingForAWhile.empty()) { useTimeout = true; if (lastWokenUp == 0) printMsg(lvlError, "waiting for locks or build slots..."); if (lastWokenUp == 0 || lastWokenUp > before) lastWokenUp = before; timeout.tv_sec = std::max((time_t) 1, (time_t) (lastWokenUp + settings.pollInterval - before)); } else lastWokenUp = 0; --8<---------------cut here---------------end--------------->8--- Bah it requires more investigations and I agree with Maxim that efbf5fdd01817ea75de369e3dd2761a85f8f7dd5 is probably not the issue there. Cheers, simon
