Hello, Ian Jackson [11/Jan 11:50am GMT] wrote: > I've looked in the user journal and attached the part surrounding this > failure. I'll now quote and comment on some of the events in the log: > > Jan 10 11:35:50 tag2upload-oracle-01 tag2upload-oracled[2473024]: > [t2u-oracled tag2upload-builder-01.debian.org,2473024][2026-01-10T11:35:50] > WARNING: builder reboot lock script failed with error exit status 1 at line > 459 > > There are a lot of these in the log. They seem to be normal. > Sean, we probably want to fix whatever this is. [ACTION]
This comes from when we close stdin to the 'read l' and then try to reap the process. It is exiting 1 at that point. POSIX and bash's read say they exit non-zero if they encounter EOF. So instead of closing STDIN, maybe we can just print an empty line to it, and then waitpid exactly as before? Or maybe we could change the scriplet to 'read l ||:' ? I think I prefer the former. > Jan 10 12:41:21 tag2upload-oracle-01 tag2upload-oracled[2556368]: > [t2u-oracled tag2upload-builder-01.debian.org,2556368][2026-01-10T12:41:21] > instance fidelity=production > Jan 10 12:41:25 tag2upload-oracle-01 tag2upload-oracled[2556397]: > autopkgtest-virt-podman [12:41:25]: disabling init based on image label > > Here, oracled starts up one worker. That worker "openw" the > autopkgtest-virt-podman "testbed". It's a helpful coincidence that > the way we're using autopkgtest-virt-podman causes this messasge, > because otherwise it might be hard to see when oracled opens the > testbed. We probalby want to log something ourselves. [ACTION] Yes, queued up a fix for this. > Now comes the job where things broke: > > Jan 10 13:46:04 tag2upload-oracle-01 tag2upload-oracled[2556396]: > [t2u-oracled tag2upload-builder-01.debian.org,2556396][2026-01-10T13:46:04] > job=2390 last_attempt= package=libnginx-mod-http-cache-purge > tag=debian/1%2.5.5-1 url=https://salsa.debian.org/nginx-team/lib > nginx-mod-http-cache-purge.git starting > > It's received. > > Jan 10 13:46:04 tag2upload-oracle-01 tag2upload-oracled[2556396]: > [t2u-oracled tag2upload-builder-01.debian.org,2556396][2026-01-10T13:46:04] > worker: invoking <<dgit-repos-server ... > > We run dgit-repos-server. dgit-repos-server started up and > immediately tried to prepare the work and bpd directories. > It failed with the error message we see above. > > Based on events later in the log, I think this must be because the > host is in the process of rebooting and the running containers have > already been shut down. (autopkgtest-virt-podman open will have > started a running container at 12:41.) > > Jan 10 13:46:09 tag2upload-oracle-01 tag2upload-oracled[2565802]: job=2390: > failed, reported by email > Jan 10 13:46:09 tag2upload-oracle-01 tag2upload-oracled[2565802]: job=2390 > final-disposition=irrecoverable > > Here oracled is reporting the failure. > > Jan 10 13:46:10 tag2upload-oracle-01 tag2upload-oracled[2556396]: > [t2u-oracled tag2upload-builder-01.debian.org,2556396][2026-01-10T13:46:10] > WARNING: builder reboot lock script failed with error exit status 1 at line > 459 > > Here's this again, which I think is normal and therefore a red herring. > > Jan 10 13:46:12 tag2upload-oracle-01 tag2upload-oracled[2565814]: Error: can > only create exec sessions on running containers: container state improper > Jan 10 13:46:12 tag2upload-oracle-01 tag2upload-oracled[2556396]: > [t2u-oracled tag2upload-builder-01.debian.org,2556396][2026-01-10T13:46:12] > WARNING: failed to remove > /srv/builder.tag2upload.debian.org/tmp/autopkgtest-virt-docker.shared.pwh6qyag/downtmp > in builder v > irt: subprocess failed with error exit status 255 at line 366 > Jan 10 13:46:12 tag2upload-oracle-01 tag2upload-oracled[2556397]: Error: can > only create exec sessions on running containers: container state improper > Jan 10 13:46:14 tag2upload-oracle-01 tag2upload-oracled[2565815]: > autopkgtest-virt-podman [13:46:14]: disabling init based on image label > > I'm not sure precisely why we see these messages. I'm not sure who is > trying to access the running container. The message about failing to > remove isprobably autopkgtest-virt-podman itself (since oracled is > mostly crash-only). (no, that's from oracled) > I'm not sure why we get another "disabling init". It seems to print > that when we "open" but if so the messages are in a funny order. The worker doesn't distinguish this situation from a failed job, so it goes ahead and gets ready for a new job. The builder virt is reopened before trying to take reboot locks (because opening the builder virt is not within the critical phase). We use 'rm -rf' to try to remove the temporary directory, because we think it may have already been removed. But that means the only way it can fail is if we're prematurely lost access to run commands, right? So shall we upgrade that from a warning to a failure? > But, I'm not sure it matters precisely who is doing what here. These > messages seem consistent with my prior supposition about the podman > state. Yes. > Jan 10 13:53:44 tag2upload-oracle-01 tag2upload-oracled[2556368]: > [t2u-oracled tag2upload-builder-01.debian.org,2556368][2026-01-10T13:53:44] > group_leader: received SIGTERM; shutting down workers > Jan 10 13:53:44 tag2upload-oracle-01 systemd[2556306]: Stopping > tag2upload-oracled.service - tag2upload Oracle daemon... > Jan 10 13:53:44 tag2upload-oracle-01 systemd[2556306]: Stopped > tag2upload-oracled.service - tag2upload Oracle daemon. > -- Boot cbbd32cac2974b5e901921187e477fa7 -- > > This is the host rebooting. > > Jan 10 13:55:35 tag2upload-oracle-01 systemd[706]: Started > tag2upload-oracled.service - tag2upload Oracle daemon. > Jan 10 13:55:39 tag2upload-oracle-01 tag2upload-oracled[892]: [t2u-oracled > tag2upload-builder-01.debian.org,892][2026-01-10T13:55:39] instance > fidelity=production > Jan 10 13:55:45 tag2upload-oracle-01 tag2upload-oracled[1262]: > autopkgtest-virt-podman [13:55:45]: disabling init based on image label > Jan 10 13:55:50 tag2upload-oracle-01 tag2upload-oracled[1357]: System is > going down. Unprivileged users are not permitted to log in anymore. For > technical details, see pam_nologin(8). > Jan 10 13:55:50 tag2upload-oracle-01 tag2upload-oracled[1357]: Connection > closed by 2607:f8f0:614:1::1274:76 port 22 > > This is during our startup. The IPv6 saddress is that of the > manager. Evidently it is being rebooted at the same time. > (We did see a manager reboot around this time.) > > We make a number of attempts. Eventually: > > Jan 10 13:58:34 tag2upload-oracle-01 tag2upload-oracled[1788]: > autopkgtest-virt-podman [13:58:34]: disabling init based on image label > > Here we manage to start up. Note that there is no log message saying > we've established a managaer connection! [ACTION] Queued up. > Jan 10 14:12:29 tag2upload-oracle-01 tag2upload-oracled[1788]: Connection to > tag2upload-builder-01.debian.org closed by remote host. > Jan 10 16:36:21 tag2upload-oracle-01 tag2upload-oracled[892]: [t2u-oracled > tag2upload-builder-01.debian.org,892][2026-01-10T16:36:21] group_leader > worker=1787: died due to fatal signal PIPE > > IHNI what these are. They are probably related. Any ideas? Alas no. > Jan 10 16:36:43 tag2upload-oracle-01 tag2upload-oracled[19678]: > autopkgtest-virt-podman [16:36:43]: disabling init based on image label > Jan 10 16:36:53 tag2upload-oracle-01 tag2upload-oracled[19677]: [t2u-oracled > tag2upload-builder-01.debian.org,19677][2026-01-10T16:36:53] job=2391 > last_attempt= package=libnginx-mod-http-cache-purge tag=debian/1%2.5.5-1 > url=https://salsa.debian.org/nginx-team/libnginx-mod-http-cache-purge.git > starting > Jan 10 16:36:53 tag2upload-oracle-01 tag2upload-oracled[19677]: [t2u-oracled > tag2upload-builder-01.debian.org,19677][2026-01-10T16:36:53] worker: invoking > <<dgit-repos-server ... > > Here comes a job. My theory about the previous messages is: > - At 14:12 the connection to the builder was terminated for unknown > reasons. But we didn't notice because we weren't reading from that > ssh, only from the manager. > - At 16:36 a job comes into the system. The manager sends us an ayt > and we try to do a `capabilities` to the testbed. It's dead > so we die with SIGPIPE. I don't understand why we would die with SIGPIPE instead of the print in ProtoConn::send_raw failing and us going via confess? > - oracled parent starts a new worker. There is no log message for > this [ACTION] but we can see the "disabling init" from it > opening the testbed. Queued up. > Tentative conclusion: our DSA reboot lock arrangements interacted > insufficiently with the container system shutdown. It seems like one > of the following happened: > > A. The timing of us taking the reboot lock is wrong, > so that it doesn't assure viability of the container. > (That would be our bug.) > > B. We were able to take the reboot lock despite the system > already having shut down our container and/or > the system shut down our container despite us holding > the reboot lock. > (We would need DSA to fix it.) > > C. Our reboot lock arrangements are somehow incompatible with DSA's > and the two systems just bypass each other. (To detect this we > might need to ask DSA to do a reboot with us watching, or > something,) I can't think of a way in which (A) could happen. It would have to be that the container is already doomed when we successfully take our lock. But that would imply that whatever has doomed the container has failed to attempt to take an exclusive lock before doing anything else? -- Sean Whitton
signature.asc
Description: PGP signature

