control: retitle -1 /var/run/reboot-lock ineffective on tag2upload-builder-01

Hello DSA,

We think that the /var/run/reboot-lock mechanism is not doing what it
should on tag2upload-builder-01.

In bug #1125239, we believe that we observe tag2upload job 2390 failing
because tag2upload-builder-01 rebooted in the middle of the critical
section, but we had taken a lock of /var/run/reboot-lock, which should
have prevented that.

We think one of the following two sequences of events occurred:

 1. SSH connection from tag2upload-oracle-01 opens a connection to a
    Podman container, doesn't take any locks, waits for a job.
A2. A reboot of tag2upload-builder begins, but an exclusive lock of
    /var/run/reboot-lock is not taken.
A3. A job comes in.  tag2upload-oracle-01 SSHes to tag2upload-builder-01
    a second time and runs flock(1) on /var/run/reboot-lock; a
    nonexclusive lock is successfully acquired.
 4. tag2upload-oracle-01 tries to start the build inside the container
    but finds it is unusable, getting errors from Podman.

-- OR --

 1. SSH connection from tag2upload-oracle-01 opens a connection to a
    Podman container, doesn't take any locks, waits for a job.
B2. A job comes in.  tag2upload-oracle-01 SSHes to tag2upload-builder-01
    a second time and runs flock(1) on /var/run/reboot-lock; a
    nonexclusive lock is successfully acquired.
B3. A reboot of tag2upload-builder begins, ignoring the lock on
    /var/run/reboot-lock.
 4. tag2upload-oracle-01 tries to start the build inside the container
    but finds it is unusable, getting errors from Podman.

We can't currently see any way in which our code failed to take the
locks; our test suite checks that our program bails out if the locks
can't be acquired.  So our tentative conclusion is that there is a bug
in your handling of /var/run/reboot-lock on tag2upload-builder-01.

Thanks!

-- 
Sean Whitton

Reply via email to