Maybe we would want to have commit-triggered builds as well and those could go in a common pool? We actually do that for PRs with GitHub now and it seems to work out okay, but it's not exactly like having a build for each commit to main
On Tue, Oct 15, 2024, 8:02 AM Jason Gerlowski <gerlowsk...@gmail.com> wrote: > > Jenkins is just the 24/7 burn/smoker. > > Ah, ok - this explains things perfectly! That was the piece I was > missing - if we're intentionally filling our project-specific Jenkins > agents, we'd make very bad candidates for the general pool. > > Thanks Uwe > > On Tue, Oct 15, 2024 at 8:52 AM Uwe Schindler <u...@thetaphi.de> wrote: > > > > Hi, > > > > Am 15.10.2024 um 13:40 schrieb Jason Gerlowski: > > > > Hi all, > > > > Appreciate the context Uwe! In poking around a bit I see there's a > > cleanup option in the job config called: "Clean up this jobs > > workspaces from other slave nodes". Seems like that might help a bit, > > though I'd have to do a little more digging to be sure. Doesn't > > appear to be enabled on either the Lucene or Solr jobs, fwiw. > > > > The problem with the build option is that it cleans the workspace after > every build so the startup time is high when a job changes nodes. So it is > better to keep the workspace intact. We use the option at moment to clean > the workspace's changes (git reset). > > > > Does anyone have a pointer to the discussion around getting these > > project-specific build nodes? I searched around in JIRA and on > > lists.apache.org, but couldn't find a request for the nodes or a > > discussion about creating them. Would love to understand the > > rationale a bit better: in talking to INFRA folks last week, they > > suggested the main (only?) reason folks use project-specific VMs is to > > avoid waiting in the general pool...but our average wait time these > > days at least looks much much longer than anything in the general > > pool. > > > > The reason for this was the following: > > > > 1. We wanted to have more details on the IO system and number of threads > and not hardcode them in the job config. This would only work if all nodes > have same number of threads and so on. The problem is that the defaults are > currently for developer boxes only, so it does not spin too many parallel > threads while executing gradle builds. If we would add a gradle option to > enable full parallelizm to when the CI env var is set (test forks/processes > tests.jvms=XXX and gradle tasks), this would be obsolete. Then the jobs > would start with Gradle's CI option and then gradle config should then use > full parallelism when running tasks and tests. > > > > At moment we have a gradle.properties file per node (also at Policeman > Jenkins), which is loaded from Jenkins user directory automatically. To > maintain those, we need shell access. > > > > Let's open a PR to fix our Gradle build to autodetect CI builds and > let's tune the number of threads logic in the autogenerator to use the > system info with other factors if CI environment variable is used: > > > > > https://github.com/apache/lucene/blob/3d6af9cecce3b6ce5c017ef6f919a2d727e0ea77/build-tools/build-infra/src/main/java/org/apache/lucene/gradle/GradlePropertiesGenerator.java#L56-L62 > > > > (this would allow us to also spare to copy files on ASF Jenkins, the > setup is rather complex). > > > > BUT: 2. The main reason why we use a separate pool is they way how our > tests work: Due to randomized tests, we run the jobs not on commit only, we > run them 24/7. So the scripts triggers are written in a way to keep the > work queue for Lucene nodes always filled. We don't start jobs on commit, > we queue a new one every hour or more often (for Lucene). Therefore the > longer wait time is "wanted". We just ensure by it that the queue is always > *full*. Jenkins triggers a new job only if no other job is triggered. If > you enqueue a new job every 5 minutes, there will always be one job running > and another one waiting for execution. > > > > If we would flood the general queue with Jobs, other ASF projects would > be very unhappy. So I'd suggest to run the "burn and smoke the repo" tests > on our own nodes 24/7 with full job queues. Maybe only put those jobs which > are not testing jobs (like publishing artifacts) to the common queue. > > > > From the "outside" looking in, it feels like we're taking on more > > maintenance/infra burden for worse results (at least as defined by > > 'uptime' and build-wait times). > > > > See above, results are not worse, you're looking at it from wrong > perspective! The job queue and waiting time is full because we want to have > the nodes occupied with jobs all the time. For normal ASF jobs people want > low waiting times, because the Jenkins should run after commits and people > breaking stuff should be informed. We use Github for this, Jenkins is just > the 24/7 burn/smoker. > > > > Uwe > > > > Best, > > > > Jason > > > > On Tue, Oct 15, 2024 at 6:12 AM Uwe Schindler <u...@thetaphi.de> wrote: > > > > Hi, > > > > I have root access on both machines. I was not aware of problems. The > > workspace name problem is a known one: If a node is down while the job > > is renamed or there are multiple nodes having the workspace it can't > > delete. In the case of multiple nodes, it only deletes the workspace on > > the node which had the job running recently. > > > > As general rule that I always use: Before renaming a job, go to the Job > > and prune the workspace from the web interface. But this has the same > > problem like described before: It only shows the workspace of the last > > node which executed the job recently. > > > > Uwe > > > > Am 14.10.2024 um 22:01 schrieb Jason Gerlowski: > > > > Of course, happy to help - glad you got some 'green' builds. > > > > Both agents should be back online now. > > > > The root of the problem appears to be that Jenkins jobs use a static > > workspace whose path is based on the name of the job. This would work > > great if job names never changed I guess. But our job names *do* > > drift - both Lucene and Solr tend to include version strings (e.g. > > Solr-check-9.6, Lucene-check-9.12), which introduces some "drift" and > > orphans a few workspaces a year. That doesn't sound like much, but > > each workspace contains a full Solr or Lucene checkout+build, so they > > add up pretty quickly. Anyway, that root problem remains and will > > need to be addressed if our projects want to continue the specially > > tagged agents. But things are healthy for now! > > > > Best, > > > > Jason > > > > On Tue, Oct 8, 2024 at 3:10 AM Luca Cavanna <java...@apache.org> wrote: > > > > Thanks a lot Jason, > > this helps a lot. I see that the newly added jobs for 10x and 10.0 have > been built and it all looks pretty green now. > > > > Thanks > > Luca > > > > On Mon, Oct 7, 2024 at 11:27 PM Jason Gerlowski <gerlowsk...@gmail.com> > wrote: > > > > Hi Luca, > > > > I suspect I'm chiming in here a little late to help with your > > release-related question, but... > > > > I stopped into the "#askinfra Office Hours" this afternoon at > > ApacheCon, and asked for some help on this. Both workers seemed to > > have disk-space issues, seemingly due to orphaned workspaces. I've > > gotten one agent/worker back online (lucene-solr-2 I believe). The > > other one I'm hoping to get back online shortly, after a bit more > > cleanup. > > > > (Getting the right permissions to clean things up was a bit of a > > process; I'm hoping to document this and will share here when that's > > ready.) > > > > There are still nightly jobs that run on the ASF Jenkins (for both > > Lucene and Solr); on the Solr side at least these are quite useful. > > > > Best, > > > > Jason > > > > > > On Wed, Oct 2, 2024 at 2:40 PM Luca Cavanna <java...@apache.org> wrote: > > > > Hi all, > > I created new CI jobs at https://ci-builds.apache.org/job/Lucene/ > yesterday to cover branch_10x and branch_10_0 . Not a single build for them > started so far. > > > > Poking around I noticed on the build history a message "Pending - all > nodes of label Lucene are offline", which looked suspicious. Are we still > using this jenkins? I have successfully used it for the release I have done > in the past, but it was already some months ago. The step of creating jobs > is still part of the release wizard process anyways, so it felt right to do > this step. I am not sure how to proceed from here, does anyone know? I also > noticed a too low disk space warning on one of the two agents. > > > > Thanks > > Luca > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > -- > > Uwe Schindler > > Achterdiek 19, D-28357 Bremen > > https://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > -- > > Uwe Schindler > > Achterdiek 19, D-28357 Bremen > > https://www.thetaphi.de > > eMail: u...@thetaphi.de > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >