Re: Jenkins issues

Uwe Schindler Tue, 15 Oct 2024 05:52:02 -0700

Hi,

Am 15.10.2024 um 13:40 schrieb Jason Gerlowski:

Hi all,


Appreciate the context Uwe!  In poking around a bit I see there's a
cleanup option in the job config called: "Clean up this jobs
workspaces from other slave nodes".  Seems like that might help a bit,
though I'd have to do a little more digging to be sure.  Doesn't
appear to be enabled on either the Lucene or Solr jobs, fwiw.

The problem with the build option is that it cleans the workspace afterevery build so the startup time is high when a job changes nodes. So itis better to keep the workspace intact. We use the option at moment toclean the workspace's changes (git reset).

Does anyone have a pointer to the discussion around getting these
project-specific build nodes?  I searched around in JIRA and on
lists.apache.org, but couldn't find a request for the nodes or a
discussion about creating them.  Would love to understand the
rationale a bit better: in talking to INFRA folks last week, they
suggested the main (only?) reason folks use project-specific VMs is to
avoid waiting in the general pool...but our average wait time these
days at least looks much much longer than anything in the general
pool.


The reason for this was the following:

*1. *We wanted to have more details on the IO system and number ofthreads and not hardcode them in the job config. This would only work ifall nodes have same number of threads and so on. The problem is that thedefaults are currently for developer boxes only, so it does not spin toomany parallel threads while executing gradle builds. If we would add agradle option to enable full parallelizm to when the CI env var is set(test forks/processes tests.jvms=XXX and gradle tasks), this would beobsolete. Then the jobs would start with Gradle's CI option and thengradle config should then use full parallelism when running tasks and tests.

At moment we have a gradle.properties file per node (also at PolicemanJenkins), which is loaded from Jenkins user directory automatically. Tomaintain those, we need shell access.

Let's open a PR to fix our Gradle build to autodetect CI builds andlet's tune the number of threads logic in the autogenerator to use thesystem info with other factors if CI environment variable is used:


https://github.com/apache/lucene/blob/3d6af9cecce3b6ce5c017ef6f919a2d727e0ea77/build-tools/build-infra/src/main/java/org/apache/lucene/gradle/GradlePropertiesGenerator.java#L56-L62

(this would allow us to also spare to copy files on ASF Jenkins, thesetup is rather complex).

*BUT: 2. *The main reason why we use a separate pool is they way how ourtests work: Due to randomized tests, we run the jobs not on commit only,we run them 24/7. So the scripts triggers are written in a way to keepthe work queue for Lucene nodes always filled. We don't start jobs oncommit, we queue a new one every hour or more often (for Lucene).Therefore the longer wait time is "wanted". We just ensure by it thatthe queue is always *full*. Jenkins triggers a new job only if no otherjob is triggered. If you enqueue a new job every 5 minutes, there willalways be one job running and another one waiting for execution.

If we would flood the general queue with Jobs, other ASF projects wouldbe very unhappy. So I'd suggest to run the "burn and smoke the repo"tests on our own nodes 24/7 with full job queues. Maybe only put thosejobs which are not testing jobs (like publishing artifacts) to thecommon queue.

 From the "outside" looking in, it feels like we're taking on more
maintenance/infra burden for worse results (at least as defined by
'uptime' and build-wait times).

See above, results are not worse, you're looking at it from wrongperspective! The job queue and waiting time is full because we want tohave the nodes occupied with jobs all the time. For normal ASF jobspeople want low waiting times, because the Jenkins should run aftercommits and people breaking stuff should be informed. We use Github forthis, Jenkins is just the 24/7 burn/smoker.

Uwe

Best,

Jason

On Tue, Oct 15, 2024 at 6:12 AM Uwe Schindler<u...@thetaphi.de> wrote:

Hi,

I have root access on both machines. I was not aware of problems. The
workspace name problem is a known one: If a node is down while the job
is renamed or there are multiple nodes having the workspace it can't
delete. In the case of multiple nodes, it only deletes the workspace on
the node which had the job running recently.

As general rule that I always use: Before renaming a job, go to the Job
and prune the workspace from the web interface. But this has the same
problem like described before: It only shows the workspace of the last
node which executed the job recently.

Uwe

Am 14.10.2024 um 22:01 schrieb Jason Gerlowski:

Of course, happy to help - glad you got some 'green' builds.

Both agents should be back online now.

The root of the problem appears to be that Jenkins jobs use a static
workspace whose path is based on the name of the job.  This would work
great if job names never changed I guess.  But our job names *do*
drift - both Lucene and Solr tend to include version strings (e.g.
Solr-check-9.6, Lucene-check-9.12), which introduces some "drift" and
orphans a few workspaces a year.  That doesn't sound like much, but
each workspace contains a full Solr or Lucene checkout+build, so they
add up pretty quickly.  Anyway, that root problem remains and will
need to be addressed if our projects want to continue the specially
tagged agents.  But things are healthy for now!

Best,

Jason

On Tue, Oct 8, 2024 at 3:10 AM Luca Cavanna<java...@apache.org> wrote:

Thanks a lot Jason,
this helps a lot. I see that the newly added jobs for 10x and 10.0 have been 
built and it all looks pretty green now.

Thanks
Luca

On Mon, Oct 7, 2024 at 11:27 PM Jason Gerlowski<gerlowsk...@gmail.com> wrote:

Hi Luca,

I suspect I'm chiming in here a little late to help with your
release-related question, but...

I stopped into the "#askinfra Office Hours" this afternoon at
ApacheCon, and asked for some help on this.  Both workers seemed to
have disk-space issues, seemingly due to orphaned workspaces.  I've
gotten one agent/worker back online (lucene-solr-2 I believe).  The
other one I'm hoping to get back online shortly, after a bit more
cleanup.

(Getting the right permissions to clean things up was a bit of a
process; I'm hoping to document this and will share here when that's
ready.)

There are still nightly jobs that run on the ASF Jenkins (for both
Lucene and Solr); on the Solr side at least these are quite useful.

Best,

Jason


On Wed, Oct 2, 2024 at 2:40 PM Luca Cavanna<java...@apache.org> wrote:

Hi all,
I created new CI jobs athttps://ci-builds.apache.org/job/Lucene/ yesterday to 
cover branch_10x and branch_10_0 . Not a single build for them started so far.

Poking around I noticed on the build history a message "Pending - all nodes of label 
Lucene are offline", which looked suspicious. Are we still using this jenkins? I 
have successfully used it for the release I have done in the past, but it was already 
some months ago. The step of creating jobs is still part of the release wizard process 
anyways, so it felt right to do this step. I am not sure how to proceed from here, does 
anyone know? I also noticed a too low disk space warning on one of the two agents.

Thanks
Luca

---------------------------------------------------------------------
To unsubscribe, e-mail:dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:dev-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail:dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:dev-h...@lucene.apache.org

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de


---------------------------------------------------------------------
To unsubscribe, e-mail:dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:dev-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail:dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:dev-h...@lucene.apache.org

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de

Re: Jenkins issues

Reply via email to