Hi,
Am 15.10.2024 um 13:40 schrieb Jason Gerlowski:
Hi all,
Appreciate the context Uwe! In poking around a bit I see there's a
cleanup option in the job config called: "Clean up this jobs
workspaces from other slave nodes". Seems like that might help a bit,
though I'd have to do a little more digging to be sure. Doesn't
appear to be enabled on either the Lucene or Solr jobs, fwiw.
The problem with the build option is that it cleans the workspace after
every build so the startup time is high when a job changes nodes. So it
is better to keep the workspace intact. We use the option at moment to
clean the workspace's changes (git reset).
Does anyone have a pointer to the discussion around getting these
project-specific build nodes? I searched around in JIRA and on
lists.apache.org, but couldn't find a request for the nodes or a
discussion about creating them. Would love to understand the
rationale a bit better: in talking to INFRA folks last week, they
suggested the main (only?) reason folks use project-specific VMs is to
avoid waiting in the general pool...but our average wait time these
days at least looks much much longer than anything in the general
pool.
The reason for this was the following:
*1. *We wanted to have more details on the IO system and number of
threads and not hardcode them in the job config. This would only work if
all nodes have same number of threads and so on. The problem is that the
defaults are currently for developer boxes only, so it does not spin too
many parallel threads while executing gradle builds. If we would add a
gradle option to enable full parallelizm to when the CI env var is set
(test forks/processes tests.jvms=XXX and gradle tasks), this would be
obsolete. Then the jobs would start with Gradle's CI option and then
gradle config should then use full parallelism when running tasks and tests.
At moment we have a gradle.properties file per node (also at Policeman
Jenkins), which is loaded from Jenkins user directory automatically. To
maintain those, we need shell access.
Let's open a PR to fix our Gradle build to autodetect CI builds and
let's tune the number of threads logic in the autogenerator to use the
system info with other factors if CI environment variable is used:
https://github.com/apache/lucene/blob/3d6af9cecce3b6ce5c017ef6f919a2d727e0ea77/build-tools/build-infra/src/main/java/org/apache/lucene/gradle/GradlePropertiesGenerator.java#L56-L62
(this would allow us to also spare to copy files on ASF Jenkins, the
setup is rather complex).
*BUT: 2. *The main reason why we use a separate pool is they way how our
tests work: Due to randomized tests, we run the jobs not on commit only,
we run them 24/7. So the scripts triggers are written in a way to keep
the work queue for Lucene nodes always filled. We don't start jobs on
commit, we queue a new one every hour or more often (for Lucene).
Therefore the longer wait time is "wanted". We just ensure by it that
the queue is always *full*. Jenkins triggers a new job only if no other
job is triggered. If you enqueue a new job every 5 minutes, there will
always be one job running and another one waiting for execution.
If we would flood the general queue with Jobs, other ASF projects would
be very unhappy. So I'd suggest to run the "burn and smoke the repo"
tests on our own nodes 24/7 with full job queues. Maybe only put those
jobs which are not testing jobs (like publishing artifacts) to the
common queue.
From the "outside" looking in, it feels like we're taking on more
maintenance/infra burden for worse results (at least as defined by
'uptime' and build-wait times).
See above, results are not worse, you're looking at it from wrong
perspective! The job queue and waiting time is full because we want to
have the nodes occupied with jobs all the time. For normal ASF jobs
people want low waiting times, because the Jenkins should run after
commits and people breaking stuff should be informed. We use Github for
this, Jenkins is just the 24/7 burn/smoker.
Uwe
Best,
Jason
On Tue, Oct 15, 2024 at 6:12 AM Uwe Schindler<u...@thetaphi.de> wrote:
Hi,
I have root access on both machines. I was not aware of problems. The
workspace name problem is a known one: If a node is down while the job
is renamed or there are multiple nodes having the workspace it can't
delete. In the case of multiple nodes, it only deletes the workspace on
the node which had the job running recently.
As general rule that I always use: Before renaming a job, go to the Job
and prune the workspace from the web interface. But this has the same
problem like described before: It only shows the workspace of the last
node which executed the job recently.
Uwe
Am 14.10.2024 um 22:01 schrieb Jason Gerlowski:
Of course, happy to help - glad you got some 'green' builds.
Both agents should be back online now.
The root of the problem appears to be that Jenkins jobs use a static
workspace whose path is based on the name of the job. This would work
great if job names never changed I guess. But our job names *do*
drift - both Lucene and Solr tend to include version strings (e.g.
Solr-check-9.6, Lucene-check-9.12), which introduces some "drift" and
orphans a few workspaces a year. That doesn't sound like much, but
each workspace contains a full Solr or Lucene checkout+build, so they
add up pretty quickly. Anyway, that root problem remains and will
need to be addressed if our projects want to continue the specially
tagged agents. But things are healthy for now!
Best,
Jason
On Tue, Oct 8, 2024 at 3:10 AM Luca Cavanna<java...@apache.org> wrote:
Thanks a lot Jason,
this helps a lot. I see that the newly added jobs for 10x and 10.0 have been
built and it all looks pretty green now.
Thanks
Luca
On Mon, Oct 7, 2024 at 11:27 PM Jason Gerlowski<gerlowsk...@gmail.com> wrote:
Hi Luca,
I suspect I'm chiming in here a little late to help with your
release-related question, but...
I stopped into the "#askinfra Office Hours" this afternoon at
ApacheCon, and asked for some help on this. Both workers seemed to
have disk-space issues, seemingly due to orphaned workspaces. I've
gotten one agent/worker back online (lucene-solr-2 I believe). The
other one I'm hoping to get back online shortly, after a bit more
cleanup.
(Getting the right permissions to clean things up was a bit of a
process; I'm hoping to document this and will share here when that's
ready.)
There are still nightly jobs that run on the ASF Jenkins (for both
Lucene and Solr); on the Solr side at least these are quite useful.
Best,
Jason
On Wed, Oct 2, 2024 at 2:40 PM Luca Cavanna<java...@apache.org> wrote:
Hi all,
I created new CI jobs athttps://ci-builds.apache.org/job/Lucene/ yesterday to
cover branch_10x and branch_10_0 . Not a single build for them started so far.
Poking around I noticed on the build history a message "Pending - all nodes of label
Lucene are offline", which looked suspicious. Are we still using this jenkins? I
have successfully used it for the release I have done in the past, but it was already
some months ago. The step of creating jobs is still part of the release wizard process
anyways, so it felt right to do this step. I am not sure how to proceed from here, does
anyone know? I also noticed a too low disk space warning on one of the two agents.
Thanks
Luca
---------------------------------------------------------------------
To unsubscribe, e-mail:dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:dev-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail:dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:dev-h...@lucene.apache.org
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de
---------------------------------------------------------------------
To unsubscribe, e-mail:dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:dev-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail:dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:dev-h...@lucene.apache.org
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:u...@thetaphi.de