> Hi Maxime, > > > On Mon, 2021-03-15 at 00:13 +0000, raid5atemyhomework via Bug reports for > > GNU Guix wrote: > > > > > Hello all, > > > [...] > > > I recently had to rebuild an OS (because I was dumb; the Guix language > > > for shepherd services can easily lead you deadlocking shepherd itself) > > > and had supreme difficulty reinstalling, [...] > > > > Reinstalling after a messed up configuration file shouldn't be necessary. > > At least when using GRUB as bootloader, guix keeps some old (& presumably > > not broken) system generations around, that can be selected when booting > > from the bootloader. (I don't recall exactly how the menu is named, > > maybe ‘Old system generations of $HOSTNAMES?) > > Unfortunately I had a long-standing latent bug in my configuration file that > triggered on a (persistent on-disk) edge case which would cause the shepherd > process to enter an infinite loop (because the shepherd configuration > language is Turing-complete enough to allow infinite loops in the first > place). All the remaining generations (since I didn't like keeping more than > a dozen, and had recently been excessively tweaking the configuration file) > had this bug, so I had no way of reverting to an even older generation that > predated the bug.
And regardless, this kind of problem shouldn't occur in the first place. * Instead of running the `start` code in the same process 1 (which is special enough that no amount of `kill -s SIGKILL 1` will work even if you manage to log into a console), `shepherd` should really run it in a separate process and monitor it if it's taking too long and possibly allow the operator to break out of it. Principle of least power and all that... * If you want details: there is a shepherd service A that is a requirement of shepherd service B, however the daemon launched by A needed to reach a particular point in its initialization before B can start talking to it. B itself will fail to start if A has not reached that point in initialization. The extra code I added to the `start` of shepherd service A was to wait for that point of initialization before A was considered "started". It turned out it was buggy in that if the point was not reached in 1 second it would inadvertently enter an incorrect looping logic (ironically, the logic was supposed to exit it after 60 seconds, but I got increment/decrement crossed, meaning it would always loop as long as you never reached -60 seconds, which was impossible....) that ended up being an infinite loop and preventing process 1 from advancing. And this point was getting delayed when the process launched by A had to do a lot of (important) data on-disk that it needed to process at startup, so it was persistent on-disk data that would need > 1 second to process, thus ensuring that the buggy code would be entered. * If this was a new computer it would also be just as screwed during installation anyway, you should consider this a fortuitous discovery of a latent bug. * New users trying out Guix System that happen to get hit by this bug might very well decide that Guix is not stable enough for them to commit to using. Thanks raid5atemyhomework