Sean Whitton writes ("Re: Bug#1125239: [tag2upload 2390] failed,
libnginx-mod-http-cache-purge 1:2.5.5-1"):
> Ian Jackson [11/Jan 11:50am GMT] wrote:
> > Jan 10 11:35:50 tag2upload-oracle-01 tag2upload-oracled[2473024]:
> > [t2u-oracled tag2upload-builder-01.debian.org,2473024][2026-01-10T11:35:50]
> > WARNING: builder reboot lock script failed with error exit status 1 at line
> > 459
>
> This comes from when we close stdin to the 'read l' and then try to reap
> the process. It is exiting 1 at that point. POSIX and bash's read say
> they exit non-zero if they encounter EOF. So instead of closing STDIN,
> maybe we can just print an empty line to it, and then waitpid exactly as
> before? Or maybe we could change the scriplet to 'read l ||:' ?
> I think I prefer the former.
I agree. If we were using a language where you can sensibly
distinguish eof from error I'd suggest differently, but this protocol
is private to oracled so it doesn't matter much.
> > I'm not sure why we get another "disabling init". It seems to print
> > that when we "open" but if so the messages are in a funny order.
>
> The worker doesn't distinguish this situation from a failed job, so it
> goes ahead and gets ready for a new job. The builder virt is reopened
> before trying to take reboot locks (because opening the builder virt is
> not within the critical phase).
Ah.
> > Jan 10 13:46:12 tag2upload-oracle-01 tag2upload-oracled[2556396]:
> > [t2u-oracled tag2upload-builder-01.debian.org,2556396][2026-01-10T13:46:12]
> > WARNING: failed to remove
> > /srv/builder.tag2upload.debian.org/tmp/autopkgtest-virt-docker.shared.pwh6qyag/downtmp
> > in builder v
...
> We use 'rm -rf' to try to remove the temporary directory, because we
> think it may have already been removed. But that means the only way it
> can fail is if we're prematurely lost access to run commands, right?
> So shall we upgrade that from a warning to a failure?
This directory has a pid or something in its path. That implies that
new ones with many different names can be generated. What ensures
that even if everything gets SIGKILL, this directory is ever deleted?
And if there *is* such a thing, why is it not sufficient even on this
codepath?
Or to put it another way, deleting a directory on cleanup isn't
crash-only and therefore is likely to be unreliable.
And it seesm to have a name that implies it's created by the virt
server so it's presumably the virt system's job to do any cleanup.
> > Jan 10 14:12:29 tag2upload-oracle-01 tag2upload-oracled[1788]: Connection
> > to tag2upload-builder-01.debian.org closed by remote host.
> > Jan 10 16:36:21 tag2upload-oracle-01 tag2upload-oracled[892]: [t2u-oracled
> > tag2upload-builder-01.debian.org,892][2026-01-10T16:36:21] group_leader
> > worker=1787: died due to fatal signal PIPE
> >
> > IHNI what these are. They are probably related. Any ideas?
>
> Alas no.
It does look like the builder rebooting.
> > Jan 10 16:36:43 tag2upload-oracle-01 tag2upload-oracled[19678]:
> > autopkgtest-virt-podman [16:36:43]: disabling init based on image label
> > Jan 10 16:36:53 tag2upload-oracle-01 tag2upload-oracled[19677]:
> > [t2u-oracled tag2upload-builder-01.debian.org,19677][2026-01-10T16:36:53]
> > job=2391 last_attempt= package=libnginx-mod-http-cache-purge
> > tag=debian/1%2.5.5-1
> > url=https://salsa.debian.org/nginx-team/libnginx-mod-http-cache-purge.git
> > starting
> > Jan 10 16:36:53 tag2upload-oracle-01 tag2upload-oracled[19677]:
> > [t2u-oracled tag2upload-builder-01.debian.org,19677][2026-01-10T16:36:53]
> > worker: invoking <<dgit-repos-server ...
> >
> > Here comes a job. My theory about the previous messages is:
> > - At 14:12 the connection to the builder was terminated for unknown
> > reasons. But we didn't notice because we weren't reading from that
> > ssh, only from the manager.
> > - At 16:36 a job comes into the system. The manager sends us an ayt
> > and we try to do a `capabilities` to the testbed. It's dead
> > so we die with SIGPIPE.
>
> I don't understand why we would die with SIGPIPE instead of the print in
> ProtoConn::send_raw failing and us going via confess?
The ProtoConn FH is a pipe, onto ssh I think, because it came from
piped open. We write to a pipe whose other end is closed. That
generates SIGPIPE. I don't understand what you don't understand...
> > Tentative conclusion: our DSA reboot lock arrangements interacted
> > insufficiently with the container system shutdown. It seems like one
> > of the following happened:
> >
> > A. The timing of us taking the reboot lock is wrong,
> > so that it doesn't assure viability of the container.
> > (That would be our bug.)
> >
> > B. We were able to take the reboot lock despite the system
> > already having shut down our container and/or
> > the system shut down our container despite us holding
> > the reboot lock.
> > (We would need DSA to fix it.)
> >
> > C. Our reboot lock arrangements are somehow incompatible with DSA's
> > and the two systems just bypass each other. (To detect this we
> > might need to ask DSA to do a reboot with us watching, or
> > something,)
>
> I can't think of a way in which (A) could happen. It would have to be
> that the container is already doomed when we successfully take our lock.
> But that would imply that whatever has doomed the container has failed
> to attempt to take an exclusive lock before doing anything else?
I think you understand this logic better than me, but that sounds
fairly convincing. I think this means we need to talk to DSA.
Ian.
--
Ian Jackson <[email protected]> These opinions are my own.
Pronouns: they/he. If I emailed you from @fyvzl.net or @evade.org.uk,
that is a private address which bypasses my fierce spamfilter.