Re: Jenkins issues

Michael Sokolov Tue, 15 Oct 2024 08:59:12 -0700

Maybe we would want to have commit-triggered builds as well and those could
go in a common pool?  We actually do that for PRs with GitHub now and it
seems to work out okay, but it's not exactly like having a build for each
commit to main


On Tue, Oct 15, 2024, 8:02 AM Jason Gerlowski <[email protected]> wrote:

> > Jenkins is just the 24/7 burn/smoker.
>
> Ah, ok - this explains things perfectly!  That was the piece I was
> missing - if we're intentionally filling our project-specific Jenkins
> agents, we'd make very bad candidates for the general pool.
>
> Thanks Uwe
>
> On Tue, Oct 15, 2024 at 8:52 AM Uwe Schindler <[email protected]> wrote:
> >
> > Hi,
> >
> > Am 15.10.2024 um 13:40 schrieb Jason Gerlowski:
> >
> > Hi all,
> >
> > Appreciate the context Uwe!  In poking around a bit I see there's a
> > cleanup option in the job config called: "Clean up this jobs
> > workspaces from other slave nodes".  Seems like that might help a bit,
> > though I'd have to do a little more digging to be sure.  Doesn't
> > appear to be enabled on either the Lucene or Solr jobs, fwiw.
> >
> > The problem with the build option is that it cleans the workspace after
> every build so the startup time is high when a job changes nodes. So it is
> better to keep the workspace intact. We use the option at moment to clean
> the workspace's changes (git reset).
> >
> > Does anyone have a pointer to the discussion around getting these
> > project-specific build nodes?  I searched around in JIRA and on
> > lists.apache.org, but couldn't find a request for the nodes or a
> > discussion about creating them.  Would love to understand the
> > rationale a bit better: in talking to INFRA folks last week, they
> > suggested the main (only?) reason folks use project-specific VMs is to
> > avoid waiting in the general pool...but our average wait time these
> > days at least looks much much longer than anything in the general
> > pool.
> >
> > The reason for this was the following:
> >
> > 1. We wanted to have more details on the IO system and number of threads
> and not hardcode them in the job config. This would only work if all nodes
> have same number of threads and so on. The problem is that the defaults are
> currently for developer boxes only, so it does not spin too many parallel
> threads while executing gradle builds. If we would add a gradle option to
> enable full parallelizm to when the CI env var is set (test forks/processes
> tests.jvms=XXX and gradle tasks), this would be obsolete. Then the jobs
> would start with Gradle's CI option and then gradle config should then use
> full parallelism when running tasks and tests.
> >
> > At moment we have a gradle.properties file per node (also at Policeman
> Jenkins), which is loaded from Jenkins user directory automatically. To
> maintain those, we need shell access.
> >
> > Let's open a PR to fix our Gradle build to autodetect CI builds and
> let's tune the number of threads logic in the autogenerator to use the
> system info with other factors if CI environment variable is used:
> >
> >
> https://github.com/apache/lucene/blob/3d6af9cecce3b6ce5c017ef6f919a2d727e0ea77/build-tools/build-infra/src/main/java/org/apache/lucene/gradle/GradlePropertiesGenerator.java#L56-L62
> >
> > (this would allow us to also spare to copy files on ASF Jenkins, the
> setup is rather complex).
> >
> > BUT: 2. The main reason why we use a separate pool is they way how our
> tests work: Due to randomized tests, we run the jobs not on commit only, we
> run them 24/7. So the scripts triggers are written in a way to keep the
> work queue for Lucene nodes always filled. We don't start jobs on commit,
> we queue a new one every hour or more often (for Lucene). Therefore the
> longer wait time is "wanted". We just ensure by it that the queue is always
> *full*. Jenkins triggers a new job only if no other job is triggered. If
> you enqueue a new job every 5 minutes, there will always be one job running
> and another one waiting for execution.
> >
> > If we would flood the general queue with Jobs, other ASF projects would
> be very unhappy. So I'd suggest to run the "burn and smoke the repo" tests
> on our own nodes 24/7 with full job queues. Maybe only put those jobs which
> are not testing jobs (like publishing artifacts) to the common queue.
> >
> > From the "outside" looking in, it feels like we're taking on more
> > maintenance/infra burden for worse results (at least as defined by
> > 'uptime' and build-wait times).
> >
> > See above, results are not worse, you're looking at it from wrong
> perspective! The job queue and waiting time is full because we want to have
> the nodes occupied with jobs all the time. For normal ASF jobs people want
> low waiting times, because the Jenkins should run after commits and people
> breaking stuff should be informed. We use Github for this, Jenkins is just
> the 24/7 burn/smoker.
> >
> > Uwe
> >
> > Best,
> >
> > Jason
> >
> > On Tue, Oct 15, 2024 at 6:12 AM Uwe Schindler <[email protected]> wrote:
> >
> > Hi,
> >
> > I have root access on both machines. I was not aware of problems. The
> > workspace name problem is a known one: If a node is down while the job
> > is renamed or there are multiple nodes having the workspace it can't
> > delete. In the case of multiple nodes, it only deletes the workspace on
> > the node which had the job running recently.
> >
> > As general rule that I always use: Before renaming a job, go to the Job
> > and prune the workspace from the web interface. But this has the same
> > problem like described before: It only shows the workspace of the last
> > node which executed the job recently.
> >
> > Uwe
> >
> > Am 14.10.2024 um 22:01 schrieb Jason Gerlowski:
> >
> > Of course, happy to help - glad you got some 'green' builds.
> >
> > Both agents should be back online now.
> >
> > The root of the problem appears to be that Jenkins jobs use a static
> > workspace whose path is based on the name of the job.  This would work
> > great if job names never changed I guess.  But our job names *do*
> > drift - both Lucene and Solr tend to include version strings (e.g.
> > Solr-check-9.6, Lucene-check-9.12), which introduces some "drift" and
> > orphans a few workspaces a year.  That doesn't sound like much, but
> > each workspace contains a full Solr or Lucene checkout+build, so they
> > add up pretty quickly.  Anyway, that root problem remains and will
> > need to be addressed if our projects want to continue the specially
> > tagged agents.  But things are healthy for now!
> >
> > Best,
> >
> > Jason
> >
> > On Tue, Oct 8, 2024 at 3:10 AM Luca Cavanna <[email protected]> wrote:
> >
> > Thanks a lot Jason,
> > this helps a lot. I see that the newly added jobs for 10x and 10.0 have
> been built and it all looks pretty green now.
> >
> > Thanks
> > Luca
> >
> > On Mon, Oct 7, 2024 at 11:27 PM Jason Gerlowski <[email protected]>
> wrote:
> >
> > Hi Luca,
> >
> > I suspect I'm chiming in here a little late to help with your
> > release-related question, but...
> >
> > I stopped into the "#askinfra Office Hours" this afternoon at
> > ApacheCon, and asked for some help on this.  Both workers seemed to
> > have disk-space issues, seemingly due to orphaned workspaces.  I've
> > gotten one agent/worker back online (lucene-solr-2 I believe).  The
> > other one I'm hoping to get back online shortly, after a bit more
> > cleanup.
> >
> > (Getting the right permissions to clean things up was a bit of a
> > process; I'm hoping to document this and will share here when that's
> > ready.)
> >
> > There are still nightly jobs that run on the ASF Jenkins (for both
> > Lucene and Solr); on the Solr side at least these are quite useful.
> >
> > Best,
> >
> > Jason
> >
> >
> > On Wed, Oct 2, 2024 at 2:40 PM Luca Cavanna <[email protected]> wrote:
> >
> > Hi all,
> > I created new CI jobs at https://ci-builds.apache.org/job/Lucene/
> yesterday to cover branch_10x and branch_10_0 . Not a single build for them
> started so far.
> >
> > Poking around I noticed on the build history a message "Pending - all
> nodes of label Lucene are offline", which looked suspicious. Are we still
> using this jenkins? I have successfully used it for the release I have done
> in the past, but it was already some months ago. The step of creating jobs
> is still part of the release wizard process anyways, so it felt right to do
> this step. I am not sure how to proceed from here, does anyone know? I also
> noticed a too low disk space warning on one of the two agents.
> >
> > Thanks
> > Luca
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> > --
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://www.thetaphi.de
> > eMail: [email protected]
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> > --
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://www.thetaphi.de
> > eMail: [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Jenkins issues

Reply via email to