Hello again,
Adam poked me on IRC today asking a few questions about the state of
Jenkins, and why we're not gnerating test binaries for download.
The reason is simple: the tests are failing.
I've discussed this topic before twice at length with little feedback:
https://lists.apache.org/thread.html/6e2bedbbf5c2b28af4237d0936dc21f056fdafa2ea0c0b457285b9dc@%3Cdev.couchdb.apache.org%3E
https://lists.apache.org/thread.html/16a310e3342d3f1ca73fb85f62829b76bbfa3759e418386b07e2827f@%3Cdev.couchdb.apache.org%3E
I have 4 specific proposals to get us back on track:
1. Get more targeted build workers for ppc64le and aarch64 platforms.
This is critical while we wait for #4 below. By having >1 hardware
platform to build on for each of these, we can hopefully pass those
architectures regularly, and start building real downloads and
Docker
images for each of these. I know the user community really wants this.
If we get at least 2 of each worker, I'll change Jenkinsfile to use
those tagged workers rather than the qemu emulation we currently
have (and is failing).
2. Receive and provision the new CouchDB Jenkins build machine. IBM is
being very generous in getting this set up, and Paul Davis mentioned
the machine should be ready in the very near future.
Provisioning will have to include Docker + the qemu support. See
https://issues.apache.org/jira/browse/INFRA-18322 for details on that
and https://issues.apache.org/jira/browse/INFRA-17404 for the general
provisioning approach (we download Jenkins .jar from the ASF machine,
set it up to be `runit`-run on boot, run as many as we can on the
machine (I think the HW was selected to run 8 of these at once),
install the prerequisites, and request the 8x worker+password infos
from ASF Infra.
We have a choice: do we set this up just as 8x Jenkins workers, or do
we also start running our own Jenkins master (potentially on
couchdb-vm2)? The motivation to do the latter would be to add
credentials that could be used for automatic uploading of binaries to
places like bintray and Docker. (I am currently engaged with Infra in
trying to solve this for many projects, including Apache OpenWhisk.
One of the major limiting factors is that the shared ASF Jenkins
master's credentials can be accessed by all users on the server. This
is obviously a security nightmare.)
At the moment, we are "OK" using the ASF Jenkins master instance. But
as soon as we start depending on this service widely (see below) it'll
be very disruptive to take it down, even for a day or two. So it may
be best to make this decision sooner rather than later.
I'll be in touch with Infra next week on the global "automated
binary builds" issue, and will ask for guidance at that time.
3. Switch our PR gate on GitHub from Travis CI to Jenkins CI. This way,
people won't be blocked on PRs waiting forever anymore, since we'll
have a lot of compute resources at our disposal. That said,
**PEOPLE HAVE TO START FIXING THE INTERMITTENT TEST CASE FAILURES**
or we'll be right back to "Hey, it didn't pass...I'll just click
Retry" again. 😒 🤢 This will have to be a team effort.
4. Get rid of all timeouts in all test cases. A few proposals for this
were made in the context of ExUnit. Can we get some more progress
here?
https://github.com/apache/couchdb/issues/2030
https://github.com/apache/couchdb/pull/2039
5. Once 4 is done, we can consider moving aarch64/ppc64le/other binary
builds to qemu support, meaning we can test all platforms just on
simple x86_64 machines. It's not a required move, but if we lose
access to the other platforms, or they go down, it's a backup
strategy.
What do people think?