Hello Adam, Thank you for a very helpful message. I'm quoting more than I'm replying to in order to copy your message to the BTS.
Adam D. Barratt [17/Jan 2:46pm GMT] wrote: > On Sat Jan 17 14:02:53 2026, [email protected] wrote: >> Hello, >> >> Aurelien Jarno [17/Jan 12:36pm GMT] wrote: >> > I haven't looked at all the details, but here are a few things from >> > the logs. >> > The reboot of tag2upload-builder-01 was scheduled at 14:12:29. It >> > indeed caused a podman container to be stopped: > [...] >> > Could you please confirm from your logs that the reboot lock was >> > indeed taken by your tag2upload job? >> >> It doesn't print anything if it successfully takes the lock, but it >> prints something and exits if it fails to take the lock (verified by >> our test suite), and the logs indicate it did not exit. So, yes, I can >> confirm that the job did indeed take the lock. > > Looking through the log of #1125239, I think some of the timings have > been confused, so it would be worth checking the process flow. Hrm, yes. The Podman error comes much earlier than 14:12. So possibly that Podman error is a completely unrelated bug in Podman. It may have been introduced by the upgrade to trixie. Ian, what are you thoughts on this? > | Jan 10 13:53:44 tag2upload-oracle-01 tag2upload-oracled[2556368]: > | [t2u-oracled tag2upload-builder-01.debian.org,2556368][2026-01-10T13:53:44] > | group_leader: received SIGTERM; shutting down workers > | Jan 10 13:53:44 tag2upload-oracle-01 systemd[2556306]: Stopping > tag2upload-oracled.service - tag2upload Oracle daemon... > | Jan 10 13:53:44 tag2upload-oracle-01 systemd[2556306]: Stopped > tag2upload-oracled.service - tag2upload Oracle daemon. > | -- Boot cbbd32cac2974b5e901921187e477fa7 -- > | > | This is the host rebooting. > > In fact, it's not - as Aurelien noted, the reboot was at 14:12. The messages > above are likely our upgrade scripts noticing that systemd user services > needed restarting due to using old versions of libraries that were upgraded > under them, and thus restarting them. Maybe we need to make that restart less > broad somehow. > > (In theory all service restarts are skipped if the machine needs a reboot, but > that's sometimes bypassed as there can be a longer delay before the reboot > occurs in some cases. We could just skip the user service restart in such > cases.) We have TimeoutStopSec=2000 in our systemd unit, and tag2upload-oracled finishes up jobs before exiting in response to a SIGTERM, so unless your restart script is overriding that TimeoutStopSec (please let us know if that's the case!), service restarts should not be an issue in themselves. > | Jan 10 14:12:29 tag2upload-oracle-01 tag2upload-oracled[1788]: Connection > to tag2upload-builder-01.debian.org closed by remote host. > | Jan 10 16:36:21 tag2upload-oracle-01 tag2upload-oracled[892]: [t2u-oracled > | tag2upload-builder-01.debian.org,892][2026-01-10T16:36:21] group_leader > | worker=1787: died due to fatal signal PIPE > | > | IHNI what these are. They are probably related. Any ideas? > > No idea on the second, but the first is the tag2upload-builder-01 reboot. > > FWIW the relevant part of our reboot scripts is: > > screen -S reboot-job -d -m sh -c 'sleep ${minwait}m ; flock > /var/run/reboot-lock true; /sbin/shutdown -r 10 \"Kernel (mass) reboot issued > by `whoami`.\" < /dev/null ' > > Are you able to determine when you should have taken the lock? If I'm > reading things correctly, what /could/ happen is: > > T: Reboot scripts manage to obtain lock, and schedule reboot for 10 minutes > later. The lock is only held for the duration of the invocation of "true". > T+X: tag2upload jobs obtain lock, unaware of the already scheduled reboot > T+10: Reboot Okay, so based on this information it looks like we have an incompatibility between our locking arrangements, regardless of whether tag2upload job 2390 failed because of a reboot. In particular, when implementing the locking I had been assuming that /var/run/reboot-lock would remain locked while a reboot was pending. But in fact it isn't. Communication goes in only one direction -- a program can communicate to DSA that a reboot would be unwanted, by taking a lock, but DSA cannot communicate to a program that doing anything critical would be a bad idea because a reboot is pending. Would it be possible to rework your scripts to keep the lock held throughout? Then communication would go in both directions and we wouldn't lose jobs. We can probably help hack on it if needed. -- Sean Whitton

