Re: Hadoop Windows Build
> On May 3, 2024, at 9:04 AM, Gavin McDonald wrote: > > Build times are in the order of days, not hours, how is the caching helping > here? It won’t help for full builds but for PRs where it only does parts of the tree it can be dramatic. (Remember: this is running Yetus which will only rebuild required modules.)
Re: Hadoop Windows Build
> On Apr 26, 2024, at 9:42 AM, Cesar Hernandez wrote: > > My two cents is to use cleanWs() instead of deleteDir() as > documented in: https://plugins.jenkins.io/ws-cleanup/ If this was a generic, run of the mill build, that could be an option. Definitely don’t want to do that for Hadoop builds. There is a bunch of caching happening to speed things up. Deleting them would be _very_ detrimental to build times.
Re: ASF Maven parent pom and use properties to define the versions of plugins
> On Jun 7, 2023, at 9:36 PM, Christopher wrote: > > I think your concern about the risks of updating plugin versions is > valid, but I don't think it has anything to do with how those plugin > versions are expressed in the parent POM. If anything, using > properties to express these versions would make it easier for you to > update the parent POM, but hold back specific plugins when those > versions cause problems for you. You could also continue doing what > you're doing now and not update the parent POM. That's perfectly > valid. I just wonder, if you're going to do that, why care about how > versions are expressed as properties in newer versions of the parent > POM, enough to offer a -1 at the idea, if you're not even interested > in using those newer versions of the parent POM? I was under the impression that a bunch of _new_ entries were suddenly going to happen with this change. I’m a big fan of less is more in my build tools.
Re: ASF Maven parent pom and use properties to define the versions of plugins
> On Jun 7, 2023, at 11:46 AM, Karl Heinz Marbaise wrote: > > Hi, > On 07.06.23 19:23, Allen Wittenauer wrote: >> >> >>> On Jun 5, 2023, at 3:28 PM, Slawomir Jaranowski >>> wrote: >>> >>> Hi, >>> >>> I want to introduce properties to define versions of plugins. >>> I have prepared PR [1] and we have a discussion about properties schema for >>> such purposes. >>> >>> Because AFS Maven parent is used by other ASF projects, and such changes >>> can be difficult to change in the next release, I want to know other >>> opinions. >> >> -1 >> >> Some projects are stuck on old versions of the pom because newer ones >> introduce plugins with bugs. e.g., MASSEMBLY-942 stopped some projects on >> v21 for a very long time. > > The issue is related to a non Apache API (build-api related to Eclipse) > abandoned (12 years old+) ... > And why does a Eclipse related bugs stops you to use that in builds? > > Which plugins are we talking exactly? Which kind of bugs have occurred? Woops, I meant MASSEMBLY-941, which left a trail of dead in its wake, all linked to in the ticket. I know I hit a bug in the latest maven pom where it (i’m guessing assembly again) tries to resolve relative symlinks and makes them absolute which then in turn blows up with the latest pom. I don’t have time to track it down, so I’ll likely just stick with an ancient version of the Apache pom. I just don’t have time to debug this stuff. Even though we only release this project maybe twice a year, every year it is “can we udpate apache pom? nope.” so at least I know I’ll likely just stop even attempting to do it.
Re: ASF Maven parent pom and use properties to define the versions of plugins
> On Jun 5, 2023, at 3:28 PM, Slawomir Jaranowski > wrote: > > Hi, > > I want to introduce properties to define versions of plugins. > I have prepared PR [1] and we have a discussion about properties schema for > such purposes. > > Because AFS Maven parent is used by other ASF projects, and such changes > can be difficult to change in the next release, I want to know other > opinions. -1 Some projects are stuck on old versions of the pom because newer ones introduce plugins with bugs. e.g., MASSEMBLY-942 stopped some projects on v21 for a very long time. So no, the parent pom needs to define less, not more. [I’m almost to the point of just forking the thing and removing bits because it is so wildly unreliable.]
Re: Broken Builds because of gradle-enterprise
> On Dec 9, 2022, at 7:43 AM, Greg Stein wrote: > > We make changes to the Jenkins environment all the time, to keep it > operational, on the latest software, to provide enough CPU and disk space > for builds, add requested plugins, and more. We do not advise projects > before we make changes because we expect no problems to arise. This fell > into that same kind of "routine change", or so we thought. From https://plugins.jenkins.io/gradle/#plugin-content-gradle-enterprise-integration: Note - The configuration applies to all builds on all connected agents matching the specified label criteria, or all in case no label criteria are defined. So need custom labels and have builds move to those labels if they want to use the feature.
Re: Multi-arch container images on DockerHub
> On Dec 6, 2022, at 4:43 AM, Robert Munteanu wrote: > > I see two alternatives so far: > > 1. Moving to GitHub actions Apache Yetus did the move from docker hub builds to Github Actions because ... > 2. Use hooks to install qemu and 'fake' a multi-arch build on Docker > Hub ... when I tried to do this a bit over a year ago, the kernel on the docker hub machines didn't support qemu. https://github.com/docker/roadmap/issues/109 seems to still be open so that functionality is likely still missing. That said, the project kept the hook in place in case it is ever supported. So... > How are other projects handling this? Or does anyone have any ideas > that they can share? ... the build is pretty much contained in two files: https://github.com/apache/yetus/blob/main/.github/workflows/ghcr.yml https://github.com/apache/yetus/blob/main/hooks/build The build file does a lot of extra work that may/may not be desired (such as building cascading container images), but it should at least work on Linux and macOS. Being able to run it locally after a bit of multi-arch setup is a _huge_ debugging win vs. going all-in on a native GH Action method. The hooks/build file is also still run on docker hub so as to not break users who are still pulling from there. At some point, we'll have a discussion in the project about getting rid of it but for now, everything is relatively consistent between the two container repos. Just that docker hub repo has one built for only x86 and the GHCR has both arm and x86 as a multi-arch image. Some unsolicited advice: keep in mind the bus factor. A lot of projects have wildly complex build systems that are maintained by one maybe two people. While those build processes may be faster or better or more functional or whatever... at some point other people will need to understand it.
Re: Meeting this Thursday
> On Oct 17, 2022, at 6:11 AM, Daan Hoogland wrote: > > - can a jenkins job be restarted for the apache jenkins server for the > exact same (merged) code? It seems the analysis job doensn´t start correctly Broadly, yes. The caveat here is that Jenkins is branch aware and tends to use the last commit to a branch as the place where restarts tend to happen. This means if you need a _specific_ commit on a branch that has had additional commits, then you’ll need something to mark that location with something that Jenkins sees (branch, pr, and tags if configured). There are some tricks to work around that “limitation.” But in practice, it generally isn’t one since the vaast majority of time, people seem to want to always test the latest commit of their branch/pr/tag. > - can the apache jenkins job be started from a PR if it wasn't added for > some reason. The same trick mentioned above is basically to change your job to take the repo/commit info and forcibly check that out. But if the PR isn’t showing up, something else went wrong, especially if a Scan Repository isn’t pulling in the PR but others are there. (I’m assuming Github Multibranch is being used… if something else is being used, Step 1: switch to GHMB!)
Re: Building with Travis - anyone?
> On Sep 1, 2022, at 2:52 AM, P. Ottlinger wrote: > over the years Travis seems to have degraded: builds fail regularly due to > technical issues or the inability to download artifacts from Maven central > that are mirrored onto Travis resources. > > Does any other project have stable Travis builds? > ... > > GithubAction builds are green, so is ASF Jenkins and local builds > > Thanks for any opinions, links or hints > Apache Yetus still has Travis builds and Github Actions running from the project's repo. I also run the other Apache Yetus-supported CIs from my personal account regularly so that the project doesn't expose the rest of the ASF projects to them. (See 'Automation' at https://yetus.apache.org/documentation/in-progress/precommit/#optional-plug-ins for current list.) The different CIs run nearly the same pipeline Anecdotally, Travis is generally the worst performing and most unreliable out of all of them, on some days by a fairly large factor. To the point that I've thought about raising a PR to remove support. So no, it isn't just Creadur. Travis has also been wildly unpredictable with changes. (e.g., limits on log sizes just got introduced in the past year or so, I seem to recall a lower memory limit added, etc) It might be a new one triggering if these failures are recent. But honestly: unless the project _really_ needs Travis, I'd recommend migrating off of it. While it sits somewhere in between Jenkins and Github Actions on the complexity scale, one is probably better off either dumbing down the build for GHA or going full bore into Jenkins for the heavier needs. (full disclosure: I haven't kept up with the ASF jenkins config since I run my own instance for Yetus testing, but I'm assuming it is still more stable than Travis given there has been little squawking on builds@ lately. Haha.)
Re: problem using maven to gpg-sign to and upload release artifacts to the nexus repository
> On Jun 9, 2022, at 7:51 AM, Rick Hillegas wrote: > > Thanks for the quick response, Maxim. > > Yes, my credentials are in ~/.m2/settings.xml. Maven is able to upload > artifacts and checksums, so the credentials are good. It's the gpg-signing > bit that's broken. > > No, I can't ssh to repository.apache.org: > > mainline (17) > ssh rhille...@repository.apache.org > rhille...@repository.apache.org: Permission denied (publickey) Last I checked, the maven-gpg-plugin (which is what gets called under the hood, IIRC) needs to be told which type of gpg you are using on some versions. So you might need to configure it for your particular build environment. For Apache Yetus, we setup a profile to do that: https://github.com/apache/yetus/blob/fae6b390b06c0f1752fc15221cea4b9cdb7d44dc/pom.xml#L264
Jenkins patch file processing via JIRA
Just a quick check: Is anyone still using Jenkins to test patches attached to JIRA issues (usually via the Precommit-admin job)? If we were to kill it from Apache Yetus, would that cause anyone heartache? (I don’t have access to the filter that is being used so no idea who is even signed up to use it.) Thanks.
Re: ephemeral builds via AWS ECS and/or EKS? GPU Nodes?
> On Dec 30, 2021, at 10:58 AM, Chris Lambertus wrote: > > Hi folks, > > We have some funding to explore providing ephemeral builds via ECS or EKS in > the Amazon ecosystem, but Infra does not have expertise in this area. We > would like to integrate such a service with Jenkins. > > Does anyone have experience with using these services for CI, and would you > be interested in assisting Infra in developing a prototype? > > Additionally, we may be able to provide some build nodes with GPUs. Do we > have projects which could/would make use of GPUs for integration testing? At $DAYJOB, I configured the Amazon EC2 plug-in ( https://plugins.jenkins.io/ec2 ) to do this type of thing using spot instances with labels tied to the particular EC2 node type that our jobs use. I avoided using the EC2 Fleet plug-in ( https://plugins.jenkins.io/ec2-fleet ) mainly because it always seemed to keep at least one node running which is not really want you want to get the most bang for your buck. In other words, startup time is less important to me than having a node run idle all weekend. Biggest issues we’ve hit with this setup are: a) Depending upon your spot price, you may get outbid and the node gets killed out from underneath you (rarely happens but it does happen with our bid) b) You need to know ahead of time what types of nodes you want to allocate and then set a label to match. For the ASF, that might be tricky given a lot of people have no idea what the actual requirements for their jobs are. c) During a Jenkins restart on rare occasions, the plug-in will ‘lose track’ of allocated nodes. We have limits for how long our allocations will last based on # of runs and idle time so generally can spot a ‘stuck’ node after a day or so. I haven’t tried configuring it use EKS because none of our stuff needs k8s yet.
Re: Github Token Permissions
> On Dec 25, 2021, at 3:42 AM, Gavin McDonald wrote: > > Hi > > On Sat, Dec 25, 2021 at 12:24 PM Gavin McDonald wrote: > I'll take a look, note that Infra has not changed anything, so we can rule > that out as a possible cause. > > I see the last two builds failed the test-patch step, but doesnt say why. > Can you let me know how you narrowed the failure down to the built in > GITHUB_TOKEN ? If you look at the raw logs, you’ll see Yetus trying to write a github status and throwing that error. If it could write, it would tell you why the job failed. Looking at a working vs. not working job setup, it is clear the token permissions have changed from write to read. At this point, I’m just going to assume that we’ll need to code around this change. :(. Not sure how we’ll do that, but…
Re: Github Token Permissions
The one that actually uses Apache Yetus to test Apache Yetus: https://github.com/apache/yetus/blob/main/.github/workflows/yetus.yml "ERROR: Failed to write github status. Token expired or missing repo:status write?" It was working fine a bit over 2 weeks ago and now it isn’t. I forgot that the ’Set up job’ section actually shows the permissions of the token. Comparing working vs. not-working, it is pretty obvious something has changed. (Given what Apache Yetus does, this functionality is _very_ critical…) > On Dec 24, 2021, at 12:29 AM, Gavin McDonald wrote: > > Hi Allen, > > Which workflow please? > > On Fri, Dec 24, 2021 at 2:59 AM Allen Wittenauer wrote: > >> >> >> Did something change with ASF github token permissions? It would appear >> one of our workflows can no longer write Statuses. (I haven’t checked if >> Checks still work or not.) > > > > -- > > *Gavin McDonald* > Systems Administrator > ASF Infrastructure Team
Github Token Permissions
Did something change with ASF github token permissions? It would appear one of our workflows can no longer write Statuses. (I haven’t checked if Checks still work or not.)
Re: Pushing Docker Images
> On Nov 18, 2021, at 1:27 PM, Chris Lambertus wrote: > > x86_64 docker on M1 is going to be running under rosetta2 emulation mode (i > didn't even know you could do that,) and would potentially be considerably > slower than native x86_64 hardware.. The results would likely be different if > you were performing this natively on AARCH64... I'm not sure what you meant > about building both amd64 and arm64, are you running an arm64 cross compiler > on an amd64 emulated docker image on an M1? > Yup. Docker’s buildx framework allows you to build multiple architecture images in an emulation mode simultaneously to avoid all the craziness of using manifests to publish the same tag with different architectures attached. More details here: https://docs.docker.com/buildx/working-with-buildx/ On Linux, it uses qemu (as above). On Docker Desktop for Mac… I’m honestly not sure what it is doing, but, I’d like to think it is using Rosetta 2 + secret sauce but it was so slow that I’m actually wondering if it doesn’t run qemu-x86 in the VM. :/ I need to spend more time playing with it to see what is going on under the hood. For the version of the Yetus containers sitting in ghcr.io, it was built using a single GitHub runner + qemu via GitHub Actions. You can see * the log of the run here: https://github.com/apache/yetus/actions/runs/1476885666 (warning: it is big so use raw mode) * the workflow here: https://github.com/apache/yetus/blob/main/.github/workflows/ghcr.yml * the raw docker commands here: https://github.com/apache/yetus/blob/main/hooks/build (Because it is in hooks/build, if Docker Hub ever fixes their stuff, Yetus will automatically pick it up.)
Re: Pushing Docker Images
PR was merged. If anyone is curious what multi-arch repo looks like under GHCR: https://github.com/apache/yetus/pkgs/container/yetus https://github.com/apache/yetus/pkgs/container/yetus-base Thanks.
Re: Pushing Docker Images
> On Nov 17, 2021, at 4:17 AM, Martin Grigorov wrote: > >>- In my trials this morning, building both amd64 and arm64 took >> ~1h. That’s at least better than my M1 Max MBP which never completed after >> several hours. >> > > Did you just say that x86_64+QEMU was faster than M1 Max ?! > I wouldn't believe it even if I see it with my eyes! :-) Haha that really does read like that doesn’t it. :D My hunch is that I need to give Docker more memory than my usual 4GB and it will complete. I just need to find time to try it out.
Re: Pushing Docker Images
> On Nov 16, 2021, at 2:34 AM, Martin Grigorov wrote: > > Hi Allen, > > I've just documented how one could use Oracle Cloud free plan to build and > test on Linux ARM64 for free! > Please check > https://martin-grigorov.medium.com/github-actions-arm64-runner-on-oracle-cloud-a77cdf7a325a > and see whether it could be helpful for your case! > You will need Apache Infra team's help for the token needed by ./config.sh > and to setup the security/approvals. Everything else could be done by you > and any member of your team! > > Feedback is welcome! Thanks! I’ll take a look at that. Ironically, I just opened a PR for multi-arch docker builds for Apache Yetus this morning via qemu on GitHub Actions. Yetus has the huge benefit of already having the bits in place to build on Docker hub so that was easily re-used. https://github.com/apache/yetus/pull/239 For those curious: - The Apache Yetus image is _huge_ (lots of tooling…) so takes a while to build anyway. - In my trials this morning, building both amd64 and arm64 took ~1h. That’s at least better than my M1 Max MBP which never completed after several hours. Next thing will likely be to figure out how to mirror or maybe we’ll just kill the apache/yetus images on Dockerhub, depending upon how things work out. 路♂️
Pushing Docker Images
Hi. For those at build multi-arch, what process are people using to push images to docker hub? We’ve been using the automated builder but it doesn’t appear to support arm64. I’m debating moving the builder … somewhere… and then pushing multi arch that way. Thoughts? Thanks.
Re: GA again unreasonably slow (again)
> On Feb 8, 2021, at 5:00 PM, Jarek Potiuk wrote: > >> I'm not convinced this is true. I have yet to see any of my PRs for > "non-big" projects getting queued while Spark, Airflow, others are. Thus > why I think there are only a handful of projects that are getting upset > about this but the rest of us are like "meh whatever." > > Do you have any data on that? Or is it just anecdotal evidence? Totally anecdotal. Like when I literally ran a Yetus PR during the builds meeting as you were complaining about Airflow having an X deep queue. My PR ran fine, no pause. > You can see some analysis and actually even charts here: > https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status Yes, and I don't even see Yetus showing up. I wonder how many other projects are getting dropped from the dataset > Maybe you have a very tiny "PR traffic" and it is mostly in the time zone > that is not affected? True, it has very tiny PR traffic right now. (Sep/Oct/Nov was different though) But if it was one big FIFO queue, our PR jobs would also get queued. They aren't even when I go look at one of the other projects that does have queued jobs. When you see Airflow backed up, maybe you should try submitting a PR to another project yourself to see what happens. All I'm saying is: right now, that document feels like it is _greatly_ overstating the problem and now that you point it out, clearly dropping data. It is problem, to be sure, but not all GitHub Actions projects are suffering. (I wouldn't be surprised if smaller projects are actually fast tracked through the build queue in order to avoid a tyranny of the majority/resource starvation problem... which would be ironic given how much of an issue that is at the ASF.)
Re: GA again unreasonably slow (again)
> On Feb 7, 2021, at 4:44 PM, Jarek Potiuk wrote: > > If you are interested - my document is here. Open for comments - happy to > add you as editors if you want (just send me your gmail address in priv). > It is rather crude, I had no time to put a bit more effort into it due to > some significant changes in my company, but it should be easy to compare > the values and see the actual improvements we can get. There are likely a > few shortcuts there and some of the numbers are "back-of-the-envelope" and > we are going to validate them even more when we implement all the > optimisations, but the conclusions should be pretty sound. > > https://docs.google.com/document/d/1ZZeZ4BYMNX7ycGRUKAXv0s6etz1g-90Onn5nRQQHOfE/edit# "For Apache projects, starting December 2020 we are experiencing a high strain of GitHub Actions jobs. All Apache projects are sharing 180 jobs and as more projects are using GitHub Actions the job queue becomes a serious bottleneck. " I'm not convinced this is true. I have yet to see any of my PRs for "non-big" projects getting queued while Spark, Airflow, others are. Thus why I think there are only a handful of projects that are getting upset about this but the rest of us are like "meh whatever."
Re: Failure with Github Actions from outside of the organization (out of a sudden!)
> On Dec 27, 2020, at 7:53 AM, kezhenxu94@apache wrote: > We (SkyWalking community) are also building some useful tools that may > benefit more ASF projects, such as a license audit tool > (http://github.com/apache/skywalking-eyes) that I believe most projects will > need it In case you weren't aware, https://creadur.apache.org/rat/ already does license auditing. I think most projects are probably using it at this point.
Re: GitHub Actions Concurrency Limits for Apache projects
> On Dec 20, 2020, at 5:20 PM, Michael A. Smith wrote: > > The Apache Avro project is looking at switching from a TravisCI/Yetus > megabuild to GitHub Actions. If you plan on moving the Yetus portion over to using the Yetus' Github Action ( https://yetus.apache.org/documentation/0.13.0/precommit/robots/githubactions/ ) , it should primarily be copying/moving the personality file to .yetus/personality.sh (it will get picked up there automatically) and setting up the workflow file. The rest should "just work." If it doesn't let us know!
Re: Docker rate limits likely spell DOOM for any Apache project CI workflow relying on Docker Hub
> On Oct 29, 2020, at 8:37 AM, Allen Wittenauer > wrote: > > > >> On Oct 28, 2020, at 11:57 PM, Chris Lambertus wrote: >> >> Infra would LOVE a smarter way to clean the cache. We have to use a heavy >> hammer because there are 300+ projects that want a piece of it, and who >> don’t clean up.. We are not build engineers, so we rely on the community to >> advise us in dealing with the challenges we face. I would be very happy to >> work with you on tooling to improve the cleanup if it improves the >> experience for all projects. > > I'll work on YETUS-1063 so that things make more sense. But in short, > Yetus' "docker-cleanup --sentinel" will purge container images if they are > older than a week, then kill stuck containers after 24 hours. That order > prevents running jobs from getting into trouble. But it also means that in > some cases it doesn't look very clean until two or three days later. But > that's ok: it is important to remember that an empty cache is a useless > cache. Those values came from experiences with Hadoop and HBase, but we can > certainly add some way to tune them. Oh, and unlike the docker tools, it > pretty much ignores labels. It does _not_ do anything with volumes, probably > something we need to add. Docs updated! Relevant pages: - http://yetus.apache.org/documentation/in-progress/precommit/docker-cleanup/ - http://yetus.apache.org/documentation/in-progress/precommit/docker/ Let me know if something doesn't make sense. Thanks!
Re: Docker rate limits likely spell DOOM for any Apache project CI workflow relying on Docker Hub
> On Oct 29, 2020, at 9:21 AM, Joan Touzet wrote: > > (Sidebar about the script's details) Sure. > I tried to read the shell script, but I'm not in the headspace to fully parse > it at the moment. If I'm understanding correctly, this will still catch > CouchDB's CI docker images if they haven't changed in a week, which happens > often enough, negating the cache. Correct. We actually tried something similar for a while and discovered that in a lot of cases, upstream packages would disappear (or worse, have security problems) thus making it look the image is still "good" when it's not. So a rebuild weekly at least guarantees some level of "yup, still good" without having too much of a negative impact. > As a project, we're kind of stuck between a rock and a hard place. We want to > force a docker pull on the base CI image if it's out of date or the image is > corrupted. Otherwise we want to cache forever, not just for a week. I can > probably manage the "do we need to re-pull?" bit with some clever CI > scripting (check for the latest image hash locally, validate the local image, > pull if either fails) but I don't understand how the script resolves the > latter. Most projects that use Yetus for their actual CI testing build the image used for the CI as part of the CI. It is a multi-stage, multi-file docker build that has each run use a 'base' Dockerfile (provided by the project) that rarely changed and a per-run file that Yetus generates on the fly, with both images tagged by either git sha or branch (depending upon context). Due to how docker image reference counts on the layers work, this makes the docker images effectively used as a "rolling cache" and (beyond a potential weekly cache removal) full builds are rare.. thus making them relatively cheap (typically <1m runtime) unless the base image had a change far up the chain (so structure wisely). Of course, this also tests the actual image of the CI build as part of the CI. (What tests the testers? philosophy) Given that Jenkins tries really hard to have job affinity, re-runs were still cheap after the initial one. [Ofc, now that the cache is getting nuked every day] Actually, looking at some of the ci-hadoop jobs, it looks like yetus is managing the cache on them. I'm seeing individual run containers from days ago at least. So that's a good sign. > Can a exemption list be passed to the script so that images matching a > certain regex are excluded? You say the script ignores labels entirely, so > perhaps not... Patches accepted. ;) FWIW, I've been testing on my local machine for unrelated reasons and I keep blowing away running containers I care about so I might end up adding it myself. That said: the code was specifically built for CI systems where the expectation should be that nothing is permanent.
Re: Docker rate limits likely spell DOOM for any Apache project CI workflow relying on Docker Hub
> On Oct 28, 2020, at 11:57 PM, Chris Lambertus wrote: > > Infra would LOVE a smarter way to clean the cache. We have to use a heavy > hammer because there are 300+ projects that want a piece of it, and who don’t > clean up.. We are not build engineers, so we rely on the community to advise > us in dealing with the challenges we face. I would be very happy to work with > you on tooling to improve the cleanup if it improves the experience for all > projects. I'll work on YETUS-1063 so that things make more sense. But in short, Yetus' "docker-cleanup --sentinel" will purge container images if they are older than a week, then kill stuck containers after 24 hours. That order prevents running jobs from getting into trouble. But it also means that in some cases it doesn't look very clean until two or three days later. But that's ok: it is important to remember that an empty cache is a useless cache. Those values came from experiences with Hadoop and HBase, but we can certainly add some way to tune them. Oh, and unlike the docker tools, it pretty much ignores labels. It does _not_ do anything with volumes, probably something we need to add.
Re: Docker rate limits likely spell DOOM for any Apache project CI workflow relying on Docker Hub
> On Oct 28, 2020, at 9:01 PM, Joan Touzet wrote: > > Even for those of us lucky enough to have sponsorship for dedicated CI > workers, it's still a problem. Infra has scripts to wipe all > not-currently-in-use Docker containers off of each machine every 24 > hours (or did, last I looked). Argh. I really hope this isn't happening again, at least on the machines where Apache Yetus' test-patch runs regularly. It can manage the local cache just fine (which is why after we implemented the docker cache cleanup code, the Hadoop nodes rarely if ever had docker space problems...). I did separate that part of the code out, so if infra wants a _smarter_ way to clean the cache on nodes where test-patch and friends aren't getting used, the docker-cleanup utility from Yetus is an option. (Although, to be fair, that utility is poorly documented. Maybe I'll work on that this week if there is interest. ) > > 2. Infra provides their own Docker registry. Projects that need images > can host them there. These will be automatically exempt. Infra will have > to plan for sufficient storage (this will get big *fast*) and bandwidth > (same). They will also have to firewall it off from non-Apache projects. Given that Apache Yetus is about to launch a Github Action to the marketplace that uses docker, I've been thinking more and more about pursuing access to the ASF's access to github's registry due to all of the fallout of Docker, Inc. flailing. Needless to say, firewalls aren't an option for what I'm needing.
Re: GitHub Actions Concurrency Limits for Apache projects
> On Oct 13, 2020, at 11:04 PM, Jarek Potiuk wrote: > This is a logic > that we have to implement regardless - whether we use yatus or pre-commit > (please correct me if I am wrong). I'm not sure about yatus, but for yetus, for the most part, yes, one would like to need to implement custom rules in the personality to exactly duplicate the overly complicated and over engineered airflow setup. The big difference is that one wouldn't be starting from scratch. The difference engine is already there. The file filter is already there. full build vs. PR handling is already there. etc etc etc > For all others, this is not a big issue because in total all other > pre-commits take 2-3 minutes at best. And if we find that we need to > optimize it further we can simply disable the '--all-files' switch for > pre-commit and they will only run on the latest commit-changed files > (pre-commit will only run the tests related to those changed files). But > since they are pretty fast (except pylint/mypy/flake8) we think running > them all, for now, is not a problem. That's what everyone thinks until they start aggregating the time across all changes...
Re: GitHub Actions Concurrency Limits for Apache projects
> On Oct 13, 2020, at 11:46 AM, Jarek Potiuk wrote: > > I am rather interested in how those kinds of cases might be handled better > by Yetus - i.e. how much smarter it can be when selecting which parts of > the tests should be run - and how you would define such relation. What > pre-commit is doing is rather straightforward (run tests on files that > changed), what I did in tests takes into account the "structure" of the > project and acts accordingly. And those are rather simple to implement. As > you'd see in my PR it's merely <100 lines in bash to find which files have > changed and based on some predefined rules select which tests to run. I'd > be really interested though if Yates can provide some better ways of > handling it? I think you are misunderstanding where Yetus sits in the stack. And I also misunderstood where you were running pre-commit; it's clear you aren't running it _also_ as part of the CI, just as part of the developer experience. (which also means there is an assumption that every PR _has_ run those tools ...) Yetus' precommit probably better thought about as a pre-merge utility. The functionality pre-dates git and commit hooks so... Anyway, let's look at Airflow. For example, in Airflow's CI is static-checks-pylint (https://github.com/apache/airflow/blob/master/.github/workflows/ci.yml). This runs on _every_ PR. Let's look at https://github.com/apache/airflow/pull/11518. This is a markdown update. There is no python code. Yet: https://github.com/apache/airflow/runs/1250824427?check_suite_focus=true 16 minutes blown on an absolutely useless check. And that's just pylint. Look at the entire CI Build workflow: https://github.com/apache/airflow/actions/runs/305454304 ~45 minutes for likely zero value. Spell check might be the only thing that happened that was actual useful. That's 45 minutes that something else could have been executing. I haven't looked at the other workflows that also ran but probably more wasted time. Under Yetus, test-patch would have detected this PR was markdown, ran markdownlint, blanks, and probably a few other things, and reported the results. It probably would have taken _maybe_ 2 minutes, most of that spent dealing with the docker image. Hooray! 43 minutes back in the executor pool for either more Airflow work or for another project! Is this an extreme example? Sure. But if you stretch these types of cuts over a large amount of PRs, it makes a huge, huge difference. Airflow is 'advanced' enough in its CI that using test-patch to cut back on _all_ of these workflows is certainly possible using custom plug-ins. But it might be easier to use smart-apply-patch's --changedfilesreport option to generate a list of files in the PR and then short-circuit workflows based upon that list of file changes. (which reminds me, we need to update the smart-apply-patch docs to cover this haha) === In the specific case of testing, test-patch will slice and dice tests based upon the build tool. So if you are using, say, maven, it will only run the unit tests for that module. There is nothing an end user needs to do. No classifications or anything like that. They get that functionality for free. Since Airflow doesn't have a build tool that Yetus supports unfortunately. So it wouldn't work out of the box, but it could be shoe-horned in by supplying a custom build-tool. Probably not worth the effort in this specific use case, frankly.
Re: GitHub Actions Concurrency Limits for Apache projects
> On Oct 13, 2020, at 9:02 AM, Jarek Potiuk wrote: > > Yep having pre-commits is cool and we extensively use it as part of our > setup in Airflow. Since we are heavily Pythonic project we are using the > fantastic https://pre-commit.com/ framework. Is pre-commit still "dumb?" i.e., it treats PRs and branches the same? Because Yetus doesn't. It gives targeted advice based upon the change. Which makes it faster during the PR cycle which is why the bigger the project, the bigger the speed bump.
Re: GitHub Actions Concurrency Limits for Apache projects
> On Oct 13, 2020, at 5:00 AM, Jarek Potiuk wrote: > Who else is using GitHub Actions extensively? It's funny you mention that. A lot of the stress on the ASF Jenkins instance was helped tremendously by deploying Apache Yetus. It might be useful for some users on GitHub Actions to take a look at using our current top of tree to see if it would be helpful for them as well as to help us test it out before we publish to the GitHub Marketplace as part of our next release. (Yetus becoming the first ... and maybe only? ... Apache project to publish there.) Web pages of interest: * Base page of Apache Yetus' testing facilities: http://yetus.apache.org/documentation/in-progress/precommit/ * The specific page about GitHub Actions: http://yetus.apache.org/documentation/in-progress/precommit/robots/githubactions/ Note that the _default_ is pure static linting to make the curve a bit easier. To get actual builds, ASF license checks, and a few other things turned on, you'll need to set the build tool and likely provide the list of plug-ins one wants to turn on. The list of plug-ins available is on the bottom of that first page. If you have any questions, let us know! Thanks.
Re: [PROPOSAL] - Change the Descriptions to installed packages on Jenkins
> On Sep 23, 2020, at 10:51 AM, Gavin McDonald wrote: > > The above list would then become:- > > JDK_1.8_latest > JDK_16_latest > JDK_1.7.0_79_unlimited security > Thoughts please. Unless there is some real *strong* objection with > technical reasons then I intend to make this change in a week or two. This change doesn't impact me in any way/shape/form (hooray container images... never mind that we're not using ASF Jenkins anymore), but if I'm allowed to bike shed for a moment, I'd recommend putting _where_ the JDKs come from. OpenJDK vs. Oracle JDK vs. Azul vs. whatever all tend to be slightly different. It might save some time later since this change will break the world anyway.
Re: Controlling the images used for the builds/releases
> On Sep 13, 2020, at 2:55 PM, Joan Touzet wrote: >> I think that any release of ASF software must have corresponding sources >> that can be use to generate those from. Even if there are some binary >> files, those too should be generated from some kind of sources or >> "officially released" binaries that come from some sources. I'd love to get >> some more concrete examples of where it is not possible. > > Sure, this is totally possible. I'm just saying that the amount of source is > extreme in the case where you're talking about a desktop app that runs in > Java or Electron (Chrome as a desktop app), as two examples. ... and mostly impossible when talking about Windows containers.
Re: Controlling the images used for the builds/releases
> On Jun 22, 2020, at 6:52 AM, Jarek Potiuk wrote: > 1) Is this acceptable to have a non-officially released image as a > dependency in released code for the ASF project? My understanding the bigger problem is the license of the dependency (and their dependencies) rather than the official/unofficial status. For Apache Yetus' test-patch functionality, we defaulted all of our plugins to off because we couldn't depend upon GPL'd binaries being available or giving the impression that they were required. By doing so, it put the onus on the user to specifically enable features that depends upon GPL'd functionality. It also pretty much nukes any idea of being user friendly. :( > 2) If it's not - how do we determine which images are "officially > maintained". Keep in mind that Docker themselves brand their images as 'official' when they actually come from Docker instead of the organizations that own that particular piece of software. It just adds to the complexity. > 3) If yes - how do we put the boundary - when image is acceptable? Are > there any criteria we can use or/ constraints we can put on the > licences/organizations releasing the images we want to make dependencies > for released code of ours? License means everything. > 4) If some images are not acceptable, shoud we bring them in and release > them in a community-managed registry? For the Apache Yetus docker image, we're including everything that the project supports. *shrugs*
Re: broken builds taking up resources
> On Jan 28, 2020, at 8:02 PM, Chris Lambertus wrote: > > > Allen, can you elaborate on what a “proper” implementation is? As far as I > know, this is baked into jenkins. We could raise process limits for the > jenkins user, but these situations only tend to arise when a build has gone > off the rails. > You are correct: the limitations come from the implementation of the jenkins slave jar. Ideally it would run the slave.jar as one user and executors as one or more users. Or at least use cgroups on Linux and RBAC on Solaris and jails on FreeBSD and ... to at least do a minimal amount of work to protect itself. Instead, it depends upon the good will of spawned processes to not shoot it or anything else running on the box. This works great for the absolutely simple case, but completely false apart for anything beyond running a handful of shell commands. Thus why I consider it idiotic. There are ways Jenkins could have done some work to prevent this situation from occurring, but alas that is not the case. Yes, it would require more setup of the client, but for those places that need (i.e., most) it would have been worth it. Instead, on-prem operators are pretty much forced to build a ton of complex machinery to prevent users from wreaking havoc. [1] Or give up and move to either Jenkins talking to cloud or dump Jenkins entirely. [1] - The best on-prem solution I came up with (before I moved my $DAYJOB stuff to cloud) was to run each executor in a VM on the box. That VM would also have a regularly scheduled job that would cause it to wipe itself and respawn via a trigger mechanism. Yeah, completely sucks, but at least it affords a lot more safety.
Re: broken builds taking up resources
> On Jan 27, 2020, at 10:52 PM, Allen Wittenauer > wrote: > > This is almost always because whatever is running on the two executors > have suffocated the system resources. ... and before I forget, a reminder: Java threads take up a file descriptor. Hadoop's unit tests were firing up 10s of thousands of threads which were eating up 10s of thousands of FDs and ultimately lead to "cannot fork, no resource" errors causing everything to come tumbling down for the Jenkins salve process. So _all_ the resources, not just RAM or whatever.
Re: broken builds taking up resources
> On Jan 27, 2020, at 6:37 PM, Andriy Redko wrote: > > Thanks a lot for looking into it. From the CXF perspective, I have seen that > many CXF builds have been aborted > because of the connection with master is lost (don't have exact builds to > point since we keep only last 3), > that could probably explain the hanging builds. This is almost always because whatever is running on the two executors have suffocated the system resources. This ends up starving the Jenkins slave.jar, thus causing the disconnect. (It's extremely important to understand that Jenkins' implementation here is sort of brain dead: the slave.jar runs as the SAME USER as the jobs being executed. This is an idiotic implementation, but it is what it is.) Anyway, in my experiences, if all/most of one type of job are failing with the node to appear to be crashed, then there is a good chance that job is the cause. So it would be great if someone could spend the effort to profile the CXF jobs to see what their actual resource consumption is. FWIW, we had this problem with Hadoop, HBase, and others on the 'Hadoop' label nodes. The answer was to: a) always run our jobs in containers that could be easily killed (freestyle Jenkins jobs that do 'docker run' generally can't be killed, despite what the UI says, because the signal never reaches the container) b) those containers had resource limits c) increase the resources that systemd is allowed to give the jenkins user After doing that, the number of failures on the Hadoop nodes dropped exponentially.
Re: Fair use policy for build agents?
> On Aug 25, 2019, at 9:13 AM, Dave Fisher wrote: > Why was Hadoop invented in the first place? To take long running tests of new > spam filtering algorithms and distribute to multiple computers taking tests > from days to hours to minutes. Well, it was significantly more than that, but ok. > I really think there needs to be a balance between simple integration tests > and full integration. You’re in luck! That’s exactly what happens! Amongst other things, I’ll be talking about how projects like Apache Hadoop, Apache HBase, and more use Apache Yetus to do context sensitive testing at ACNA in a few weeks.
Re: Fair use policy for build agents?
> On Aug 23, 2019, at 2:13 PM, Christofer Dutz > wrote: > > well I agree that we could possibly split up the job into multiple separate > builds. I’d highly highly highly recommend it. Right now, the job effectively has a race condition: a job-level timer based upon the assumption that ALL nodes in the workflow will be available within that timeframe. That’s not feasible long term. > However this makes running the Jenkins Multibranch pipeline plugin quite a > bit more difficult. Looking at the plc4x Jenkinsfile, prior to INFRA creating the ’nexus-deploy’ label and pulling H50 from the Ubuntu label, it wouldn’t have been THAT difficult. e.g., this stage: ``` stage('Deploy') { when { branch 'develop' } // Only the official build nodes have the credentials to deploy setup. agent { node { label 'ubuntu' } } steps { echo 'Deploying' // Clean up the snapshots directory. dir("local-snapshots-dir/") { deleteDir() } // Unstash the previously stashed build results. unstash name: 'plc4x-build-snapshots' // Deploy the artifacts using the wagon-maven-plugin. sh 'mvn -f jenkins.pom -X -P deploy-snapshots wagon:upload' // Clean up the snapshots directory (freeing up more space after deploying). dir("local-snapshots-dir/") { deleteDir() } } } ``` This seems pretty trivially replaced with build (https://jenkins.io/doc/pipeline/steps/pipeline-build-step/#build-build-a-job) and copyartifacts. Just pass the build # as a param between jobs. Since the site section also has the same sort of code and problems, a Jenkins pipeline library may offer code consolidation facilities to make it even easier. > And the thing is, that our setup has been working fine for about 2 years and > we are just recently having these problems. Welp, things change. Lots of project builds break on a regular basis because of policy decisions, the increase in load, infra software changes, etc. Consider it very lucky it’s been 2 years. The big projects get broken on a pretty regular basis. (e.g., things like https://s.apache.org/os78x just fall from the sky with no warning. This removal broke GitHub multi branch pipelines as well and many projects I know of haven’t switched. It’s just easier to run Scan every-so-often thus making the load that much worse ...) I should probably mention that many many projects already have their website and deploy steps separated from their testing job. It’s significantly more efficient on a global/community basis. In my experiences with Jenkins and other FIFO job deployment systems (as well as going back to your original question): fairness is better achieved when the jobs are faster/smaller because it gives the scheduler more opportunities to spread the load. > So I didn't want to just configure the actual problem away, because I think > with splitting up the into multiple separate > jobs will just Bring other problems and in the end our deploy jobs will then > just still hang for many, many hours. Instead, this is going to last for another x years and then H50 is going to get busy again as everyone moves their deploy step to that node. Worse, it’s going to clog up the Ubuntu label even more because those jobs are going to tie up the OTHER node that their job is associated with while the H50 job runs. plc4x at least as the advantage that it’s only breaking itself when it’s occupying the H50 node. As mentioned earlier, the ‘websites’ stage has the same issue and will likely be the first to break since there are other projects that are already using that label.
Re: Fair use policy for build agents?
> On Aug 23, 2019, at 9:44 AM, Gavin McDonald wrote: > The issue is, and I have seen this multiple times over the last few weeks, > is that Hadoop pre-commit builds, HBase pre-commit, HBase nightly, HBase > flaky tests and similar are running on multiple nodes at the same time. The precommit jobs are exercising potential patches/PRs… of course there are going to be multiples running on different nodes simultaneously. That’s how CI systems work. > It > seems that one PR or 1 commit is triggering a job or jobs that split into > part jobs that run on multiple nodes. Unless there is a misconfiguration (and I haven’t been directly involved with Hadoop in a year+), that’s incorrect. There is just that much traffic on these big projects. To put this in perspective, the last time I did some analysis in March of this year, it works out to be ~10 new JIRAs with patches attached for Hadoop _a day_. (Assuming an equal distribution across the year/month/week/day. Which of course isn’t true. Weekdays are higher, weekends lower.) If there are multiple iterations on those 10, well…. and then there are the PRs... > Just yesterday I saw Hadoop and HBase > taking up nearly 45 of 50 H* nodes. Some of these jobs take many hours. > Some of these jobs that take many hours are triggered on a PR or a commit > that could be something as trivial as a typo. This is unacceptable. The size of the Hadoop jobs is one of the reasons why Yahoo!/Oath gave the ASF machine resources. (I guess that may have happened before you were part of INFRA.) Also, the job sizes for projects using Yetus are SIGNIFICANTLY reduced: the full test suite is about 20 hours. Big projects are just that, big. > HBase > in particular is a Hadoop related project and should be limiting its jobs > to Hadoop labelled nodes H0-H21, but they are running on any and all nodes. Then you should take that up with the HBase project. > It is all too familiar to see one job running on a dozen or more executors, > the build queue is now constantly in the hundreds, despite the fact we have > nearly 100 nodes. This must stop. ’nearly 100 nodes’: but how many of those are dedicated to specific projects? 1/3 of them are just for Cassandra and Beam. Also, take a look at the input on the jobs rather than just looking at the job names. It’s probably also worth pointing out that since INFRA mucked with the GitHub pull request builder settings, they’ve caused a stampeding herd problem. As soon as someone runs scan on the project, ALL of the PRs get triggered at once regardless of if there has been an update to the PR or not. > Meanwhile, Chris informs me his single job to deploy to Nexus has been > waiting in 3 days. It sure sounds like Chris’ job is doing something weird though, given it appears it is switching nodes and such mid-job based upon their description. That’s just begging to starve. === Also, looking at the queue this morning (~11AM EDT), a few observations: * The ‘ubuntu’ queue is pretty busy while ‘hadoop’ has quite a few open slots. * There are lots of jobs in the queue that don’t support multiple runs. So they are self-starving and the problem lies with the project, not the infrastructure. * A quick pass show that some of the jobs in the queue are tied to specific nodes or have such a limited set of nodes as possible hosts that _of course_ they are going to get starved out. Again, a project-level problem. * Just looking at the queue size is clearly not going to provide any real data as what the problems are without also looking into why those jobs are in the queue to begin with.
Re: Fair use policy for build agents?
Something is not adding up here… or I’m not understanding the issue... > On Aug 22, 2019, at 6:41 AM, Christofer Dutz > wrote: > we now had one problem several times, that our build is cancelled because it > is impossible to get an “ubuntu” node for deploying artifacts. > Right now I can see the Jenkins build log being flooded with Hadoop PR jobs. The master build queue will show EVERY job regardless of label and will schedule the first job available for that label in the queue (see below). In fact, the hadoop jobs actually have a dedicated label that most of the other big jobs (are supposed to) run on: https://builds.apache.org/label/Hadoop/ Compare this to: https://builds.apache.org/label/ubuntu/ The nodes between these two are supposed to be distinct. Of course, there are some odd-ball labels out there that have a weird cross-section: https://builds.apache.org/label/xenial/ Anyway ... > On Aug 23, 2019, at 5:22 AM, Christofer Dutz > wrote: > > the problem is that we’re running our jobs on a dedicated node too … Is the job running on a dedicated node or the shared ubuntu label? > So our build runs smoothly: Doing Tests, Integration Tests, Sonar Analysis, > Website generation and then waits to get access to a node that can deploy and > here the job just times-out :-/ The job has multiple steps that runs on multiple nodes? If so, you’re going to have a bad time if you’ve put a timeout for the entire job. That’s just not realistic. If it actually needs to run on multiple nodes, why not just trigger a new job via a pipeline API call (buildJob) that can sit in the queue and take the artifacts from the previously successful run as input? Then it won’t time out. > Am August 22, 2019 10:41:36 AM UTC schrieb Christofer Dutz > : > Would it be possible to enforce some sort of fair-use policy that one project > doesn’t block all the others? Side note: Jenkins default queuing system is fairly primitive: pretty much a node-based FIFO queue w/a smattering of node affinity. Node has a free slot, check first job in the queue. Does it have a label or node property that matches? Run it. If not, go to the next job in the queue. It doesn’t really have any sort of real capacity tracking to prevent starvation.
Re: External CI Service Limitations
> On Jul 3, 2019, at 3:15 PM, Joan Touzet wrote: > > I was asking if any of the service platforms provided this. So far, it looks > like no. I was playing around bit with Drone today because we actually need ARM in $DAYJOB and this convo reminded me that I needed to check it out. So far, I’m a little underwhelmed with the feature set. (No built-in artifacting, no junit output processing, buggy/broken yaml parser, … to be fair, they are relatively new so likely still building these things up) BUT! They do support gitlab and acting as a gitlab ci runner. So theoretically one could do linux/x86, windows/x86, mac os x, and linux/arm off of a combo of gitlab ci + drone.
Re: External CI Service Limitations
> On Jul 3, 2019, at 8:04 AM, Joan Touzet wrote: > > (With my CouchDB release engineer hat on only) > > Anyone know if any of these external services supports platforms other > than amd64/x86_64? AFAIK, all of the commercial SaaS companies who have an "open source can use for free” bit that I have experience with only do x86.[*] Most of them that have external runners do support ‘bring your own non-x86’ machine types. There’s also things like OpenLab and the Linux Foundation machines too. (The latter provided PowerPC machine access to ASF Jenkins but I know when I tried to use them years ago they were incredibly unstable and the software install was extremely lacking to the point of being ultimately unusable for the project.) I’ve never worked with pure pay companies in this area, so there are likely companies out there. It might be worthwhile for someone to do reach out to one since we might get free access for Exposure Bucks or something. > CouchDB keeps receiving a lot of pressure to build on aarch64, ppc64le > and s390x, which keeps pushing us back to Jenkins CI (ASF or > independent). And if we have to do that, then not much else matters to us. One of the nice things about using a system that supports external runners is that it allows for contributions of CPU time from like minded individuals. I wouldn’t trust them to do anything more than run tests though. * - The big problem here is the lack of cost effective non-x86 machines from cloud providers. Drone, for example, does do ARM since it’s an extension of Packet’s cloud. But I don’t have experience with it so didn’t mention it earlier…. Maybe I’ll bang on it over the long weekend.
Re: External CI Service Limitations
> On Jul 2, 2019, at 11:12 PM, Jeff MAURY wrote: > > Azure pipeline vas the big plus of supporting Linux Windows and macos nodes There’s a few that support various combinations of non-Linux. Gitlab CI has been there for a while. Circle CI has had OS X and is in beta with Windows. Cirrus CI has all those plus FreeBSD. etc, etc. It’s quickly becoming required that cloud-based CI systems do more than just throw up a Linux box. > And i think you can add you nodes to the pools I think they are limited to being on Azure tho, IIRC. But I’m probably not. I pretty much gave up on doing anything serious with it. I really wanted to like pipelines. The UI is nice. But in the end, Pipelines was one of the more frustrating ones to work with in my experience—and that was with some help from the MS folks. It suffers by a death of a thousand cuts (lack of complex, real-world examples, custom docker binary, pre-populated bits here and there, a ton of env vars, artifact system is a total disaster, etc, etc). Lots of small problems that add up to just not being worth the effort. Hopefully it’s improved since I last looked at it months and months ago though.
Re: External CI Service Limitations
> On Jul 2, 2019, at 10:21 PM, Greg Stein wrote: > > We'll keep this list apprised of anything we find. If anybody knows of, > and/or can recommend a similar type of outsourced build service ... we > *absolutely* would welcome pointers. FWIW, we’ve been collecting them bit by bit into Apache Yetus ( http://yetus.apache.org/documentation/in-progress/precommit-robots/ ): * Azure Pipelines * Circle CI * Cirrus CI * Gitlab CI * Semaphore CI * Travis CI They all have some pros and cons. I’m not going to rank them or anything. I will say, however, it really feels like Gitlab CI is the best bet to pursue since one can add their own runners to the Gitlab CI infrastructure dedicated to their own projects. That ultimately means that replacing Jenkins slaves is a very real possibility. (Also, I’ve requested access to the Github Actions beta, but haven’t received anything yet. I have a hunch that the reworking of the OAuth permission model is related, which may make some of these more viable for the ASF.)
Re: GitHub PR -> Multi-branch Jenkins pipeline triggering
> On May 10, 2019, at 2:04 PM, Zoltán Nagy wrote: > > In Zipkin (Incubating) we have a multi-branch pipeline building all our > projects: > https://builds.apache.org/view/Z/view/Zipkin/job/GH-incubator-zipkin/ > > Unfortunately pull-requests don't seem to trigger runs (or perhaps rather, > repository scans) on any of the projects. Given that we don't have admin > access to the GH repositories, root-causing this is turning out to be > tricky. I was hoping you might be able help us figure out what we're > missing. Probably the web hook configuration on the GitHub side, which, IIRC, has to be done as an INFRA request. e.g., https://issues.apache.org/jira/browse/INFRA-17471
Re: Jenkins Build for Heron
> On Apr 23, 2019, at 11:50 AM, Josh Fischer wrote: > 1. Does the Jenkins box have the build tools listed below already? Or do > you think it would be better if I downloaded and installed in the workspace > for each build? I’d *highly* recommend using a docker container so that you can control the software versions. Especially with node involved.
Re: PRJenkins builds for Projects
> On Jan 12, 2019, at 7:58 AM, Allen Wittenauer > wrote: > > > For Apache Yetus, we do a few things to circumvent this problem, > including making sure ${HOME} is defined and doing run specific docker images > ( > https://github.com/apache/yetus/blob/master/precommit/src/main/shell/test-patch-docker/Dockerfile.patchspecific > ) based upon a provided docker file or docker tag. Ha. Just noticed there is a bug in that file.
Re: PRJenkins builds for Projects
> On Jan 11, 2019, at 11:23 PM, Dominik Psenner wrote: > > I can enlist another pain point I faced while implementing the pipeline for > log4net. I had to find a way of detecting the uid/guid of the jenkins user > to make it work with dotnet core commandline inside docker. That really got > my head aching and as far as I can remember stemmed from the fact that the > dotnet commandline was unable to the detect the home directory and > attempted to write files into places of the filesystem that it was not > supposed to. I’m guessing you are using maven via the docker agent? In my experiences, this is a combination of docker implementation details, missing features in Jenkins, and misleading maven documentation (hint: system property user.home does not always equal $HOME!). There is a not-very-well documented workaround probably because it demonstrates a security weakness in the Jenkins agent+Docker. Just mount /etc/passwd and friends in your container: pipeline { agent { docker { image ‘foo’ label ‘bar' args ‘-v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -v ${HOME}:${HOME}’ } } … } Note that for Docker-in-Docker setups, this sometimes completely falls apart (e.g, jnlp and OS X agents) due to the docker daemon running in a different namespace of where the jenkins users is. (OS X’s docker implementation just flat out lies about the universe!) There’s also the issue of conflicting user/group definitions but it is what it is. Luckily, this usually works well enough to workaround executables that don’t honor $HOME variables and instead look at passwd information. For Apache Yetus, we do a few things to circumvent this problem, including making sure ${HOME} is defined and doing run specific docker images ( https://github.com/apache/yetus/blob/master/precommit/src/main/shell/test-patch-docker/Dockerfile.patchspecific ) based upon a provided docker file or docker tag. I wish Jenkins did something similar. If it provided a hook to a run specific Dockerfile that does the necessary magic before launching so that the Jenkins user could be defined we’d all be better off. (cgroup+user remapping might also work but I doubt it.)
Re: PRJenkins builds for Projects
> On Jan 10, 2019, at 11:28 PM, Stephen Connolly > wrote: > > On Fri 11 Jan 2019 at 06:28, Joan Touzet wrote: >> >> I'm willing to believe that Jenkins, the software, is incapable of > > > I assume you meant capable rather than incapable. Nope, I agree with Joan: incapable is probably the correct word. I’ve lost track of how many issues I’ve hit just in the past week of missing or broken features. [1] It’s very clear (esp with Blue Ocean and Pipelines) that Cloudbees is trying very hard to push people into a very simplified model of CI. Anything complex either can’t be done or requires so much work that it isn’t practical. [2] >> What about buildbot? Or another technology we could use with INFRA's >> support? Last time I looked at buildbot, its integration with Docker >> was very poor. >> >> I don't have any special attachment to Jenkins. IMO, this is probably something we as a community should look into doing. We’re pushing Jenkins way harder that what it feels like it was designed to do, given how many issues we hit on a regular basis and some of the core limitations of the platform. 1 - My current favorite being JENKINS-17116, which I’ve hit in both freestyle and pipeline jobs to the point that I ended up writing pid handling because I can’t trust Jenkins to actually signal processes properly. 2 - JENKINS-27413… and Glick’s answer just re-affirms in my mind that Jenkins is getting dumbed down: why push this off to a plugin?
Re: Can we package release artifacts on builds.a.o?
> On Jan 7, 2019, at 11:50 AM, Alex Harui wrote: > > I don't understand. Who am I "making" do what work? And why do at least 3 > others want something similar? And what would you propose Royale should do > instead? Always have me cut releases? If their computers are broken, they could always spin up a free micro instance on AWS. Cut the release. Then spin it down. But it really makes me wonder how they are developing and testing their changes locally if building is such a burden ...
Re: PRJenkins builds for Projects
> On Jan 6, 2019, at 10:43 AM, Dominik Psenner wrote: > > On Sun, Jan 6, 2019, 19:32 Allen Wittenauer > >> >> a) The ASF has been running untrusted code since before Github existed. >> From my casual watching of Jenkins, most of the change code we run doesn’t >> come from Github PRs. Any solution absolutely needs to consider what >> happens in a JIRA-based patch file world. [footnote 1,2] >> > > If some project build begins to draw resources in an extraordinary fashion > it will be noticed. Strongly disagree. My cleaner code killed three stuck surefire jobs that had been looping on a handful of cores since sometime in 2018 yesterday. The sling jobs I noted earlier in the week had 20GB of RAM. That’s even before we get into the unit-tests-that-are-really-integration-tests that are coming from the big data projects where gigs of memory and thousands of process slots are consumed on a regular basis.
Re: PRJenkins builds for Projects
a) The ASF has been running untrusted code since before Github existed. From my casual watching of Jenkins, most of the change code we run doesn’t come from Github PRs. Any solution absolutely needs to consider what happens in a JIRA-based patch file world. [footnote 1,2] b) Making everything get reviewed by a committer before executing is a non-starter. For large communities, precommit testing acts as a way for contributors to get feedback prior to a committer even getting involved. This allows for change iteration prior to another human spending time on it. But the secondary effect is that it acts as a funnel: if a project gets thousands of change requests a year [footnote 3], it’s now trivial for committers to focus their energy on the ones that are closest to commit. c) We’ve needed disposable environments (what Stephen Connolly called throwaway hardware and is similar to what Dominik Psenner talked about wrt gitlab runners) for a while. When INFRA enabled multiple executors per node (which they did for good reasons), it triggered an avalanche of problems: maven’s lack of repo locking, noisy neighbors, Jenkins’ problems galore (security and DoS which still exist today!), systemd’s cgroup limitations, and a whole lot more. Getting security out of them is really just extra at this point. 1 - With the forced moved to gitbox, this may change, but time will tell. 2 - FWIW: Gavin and I have been playing with Jenkins’ JIRA Trigger Plugin and finding that it’s got some significant weaknesses and needs a lot of support code to make viable. This means we’ll likely be sticking with some form of Yetus’ precommit-admin for a while longer. :( So the bright side here is that at least the ASF owns the code to make it happen. 3 - Some perspective: Hadoop generated ~6500 JIRAs with patch files attached last year alone for the nearly 15 or so active committers to review. If half of the issues had the initial patch plus a single iteration, that’s 13,000 patches that got tested on Jenkins.
Re: PRJenkins builds for Projects
> On Jan 4, 2019, at 1:06 PM, Joan Touzet wrote: > > > - Original Message - >> From: "Allen Wittenauer" > >> This is the same model the ASF has used for JIRA for a decade+. >> It’s always been possible for anyone to submit anything to Jenkins >> and have it get executed. Limiting PRs or patch files in JIRAs to >> just committers is very anti-community. (This is why all this talk >> about using Jenkins for building artifacts I find very >> entertaining. The infrastructure just flat out isn’t built for it >> and absolutely requires disposable environments.) > > Then we build a new, additional Jenkins that is committer-only (or PMC- > only, perhaps, if it's for release purposes). This is a tractable > problem. I think people forget that the ASF is a non-profit for individuals. It’s not a business. It’s not a non-profit that requires its members to be companies willing to pay astronomical fees. People-time is almost all volunteer. As such, time to work on these problems is in *extremely* short supply, never mind the actual hardware, power, etc, costs. That’s not even covering the legal issues... > We are stuck at an impasse where people need something to reduce the > manual workload, and we have an obsolete policy standing in its way. I’m honestly confused as to why suddenly running scripts on one server vs. running them on another one suddenly makes the release process less manual. > We must be the last organisation in the world where people are forced > to release software through a manual process. lol, no, hardly. How many other non-profits have this much software with so few paid employees running the show? > I don't see why this is something to be gleeful about. Being entertained is not the same thing as being gleeful.
Re: PRJenkins builds for Projects
> On Jan 4, 2019, at 2:00 AM, Christofer Dutz wrote: > > Hmmm, > > thinking about it ... this is not quite "safe" is it? Just imagining someone > starting PRs with maven download-plugin and exec-plugin starting a bitcoin > miner or worse ... what does Infra think about this? > Would prefer the "everyone" PR builds to run on Travis or something that > wouldn't harm the ASF. This is the same model the ASF has used for JIRA for a decade+. It’s always been possible for anyone to submit anything to Jenkins and have it get executed. Limiting PRs or patch files in JIRAs to just committers is very anti-community. (This is why all this talk about using Jenkins for building artifacts I find very entertaining. The infrastructure just flat out isn’t built for it and absolutely requires disposable environments.)
Re: PRJenkins builds for Projects
> On Jan 3, 2019, at 7:34 AM, Christofer Dutz wrote: > > Hi Allen, > > thanks for that ... if I had known that simply selecting the "GitHub" as > source instead of the generic "Git" ... would have made things easier ... > however it seems that we have exceeded some sort of API usage limit: Yup. That’s why we set up our own project-specific user to query GitHub and set trust to ‘Everyone’ since our user doesn’t have privs on Github. :/ (See also: last month’s discussion of github’s idiotic permission system.)
Re: Please pick up after yourself
> On Jan 3, 2019, at 7:15 AM, Christofer Dutz wrote: > > Is there a way to check the status of a project? There isn’t any sort of global “bad list”. haha. At least, I know I personally gave up trying to keep track of the strays … > I would like to help improve and have done some things, but I need a way to > see that what I'm doing is helping. Just doing a simple ps -ef | grep “${WORKSPACE}” as a post {} would show if anything is hanging about for maven projects (since they tend to full path everything, and unless your pipeline is doing bizarro things, $WORKSPACE should be specific to your job’s executor).
Re: PRJenkins builds for Projects
> On Jan 3, 2019, at 7:14 AM, Christofer Dutz wrote: > > I can't see that ... where can we find that ... and we don't want to > automatically push everything that works. > > From the description of "Enable Git validated merge support" it would > automatically push everything that passes the build ... that doesn't sound > desirable. > When I look at the "GitHub-Projekt" plugins description ... this doesn't seem > to handle PRs. > > Chris Take a look at https://builds.apache.org/view/S-Z/view/Yetus/job/yetus-github-multibranch/configure . It does not do a push, handles PRs, branches, and forks.
Re: Please pick up after yourself
> On Jan 3, 2019, at 3:11 AM, Bertrand Delacretaz > wrote: > > Hi, > > On Fri, Dec 21, 2018 at 10:53 PM Allen Wittenauer > wrote: > >> ...Culprits: Accumulo, Reef, and Sling. > > Sling has a few hundred modules, if you have more specific info on > which are problematic please let us know so we have a better chance of > fixing that. I gave up and wrote a (relatively simple) pre-amble for our jobs to shoot any long running processes that are still hanging out in the workspace directories. Output gets logged in the console log. e.g.: == USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND jenkins 24952 0.0 0.0 3476248 96 ?Sl2018 23:32 /home/jenkins/tools/java/latest1.7/bin/java -Xmx512m -Xms256m -Djava.awt.headless=true -XX:MaxPermSize=256m -Xss256k -jar /home/jenkins/jenkins-slave/workspace/jclouds-2.1.x/jclouds-labs-2.1.x/jdbc/target/surefire/surefirebooter8344429480529768484.jar /home/jenkins/jenkins-slave/workspace/jclouds-2.1.x/jclouds-labs-2.1.x/jdbc/target/surefire/surefire6624482576438364006tmp /home/jenkins/jenkins-slave/workspace/jclouds-2.1.x/jclouds-labs-2.1.x/jdbc/target/surefire/surefire_44678967186117353271tmp Killing 24952 *** USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND jenkins 53339 0.0 0.4 30068248 462472 ? Sl2018 3:23 /usr/local/asfpackages/java/jdk1.8.0_191/jre/bin/java -jar /home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/surefire/surefirebooter4295922957398927030.jar /home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/surefire/surefire8873399700577323873tmp /home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/surefire/surefire_09146430567560271463tmp Killing 53339 *** USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND jenkins 53381 1.5 2.4 13640196 2447672 ?Sl2018 72:48 /usr/local/asfpackages/java/jdk1.8.0_191/jre/bin/java -Xmx2048m -jar /home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/dependency/org.apache.sling.launchpad-8.jar -p 42022 -Dsling.run.modes=author,notshared Killing 53381 *** USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND jenkins 55854 105 2.4 13638076 2422584 ?Sl2018 4967:06 /usr/local/asfpackages/java/jdk1.8.0_191/jre/bin/java -Xmx2048m -jar /home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/dependency/org.apache.sling.launchpad-8.jar -p 38732 -Dsling.run.modes=publish,notshared Killing 55854 *** === BTW, I hope people realize that surefire doesn’t actually report all unit test failures. It makes the assumption that a unit test will write an XML file. If the unit test gets stuck or any number of other things, it won’t get reported as a failure. It’s why maven jobs absolutely need to do a post-action to check for these things (and then kill them so they don’t hang around eating resources). Hint: running in a docker container makes the post-action required for this much more fool-proof. I’m also growing more and more suspicious of some of the tuning on the build nodes. I have a hunch that other systemd bits beyond pid limits need to get changed since it doesn’t appear that all node resources are actually available to the ‘jenkins’ user. But I can’t pin down exactly which ones they are. I do know that since adding the ‘kill processes > 24 hours’ code, my own jobs have only failed due to “can’t exec” errors only once.
Please pick up after yourself
I’m now at 4 times this week where my build job has landed on a node that has broken JVM tasks hanging about from surefire tests gone awry. (Culprits: Accumulo, Reef, and Sling.) Due to the way Linux does process limits on systemd-based boxes, even though there is plenty of CPU and memory, my tasks are getting killed because all of these surefire tests have spawned enough threads that everything else fails. Folks: please, if you aren’t running in a docker container (which makes it extremely easy to clean as well as enforce a sub-5k process limit), please add a Post Action on your Jenkins job to blow away your tasks that are still hanging around. At this point, I feel like I have no choice but to just start nuking any long running java processes (-agent/slave.jar and the datadog stuff that infra runs) before startup just so I can get a build. :(
Re: Non committer collaborators on GitHub
> On Dec 14, 2018, at 9:21 AM, Joan Touzet wrote: > > Allen Wittenauer wrote: >> I think part of the basic problem here is that Github’s view of permissions >> is really awful. It is super super dumb that accounts have to have >> admin-level privileges for repos to use the API to do some basic things that >> can otherwise be gleaned by just scraping the user-facing website. If >> anyone from Github is here, I’d love to have a chat. ;) > > FYI I've previously been told we can't use addons to GitHub to improve > the issue management workflow (like https://waffle.io/) precisely > because GitHub's permissions model is so poor, allowing an external > tool to move tickets around requires giving it effectively commit > access, which is forbidden to third parties. Putting my thinking cap on, I wonder if the workaround here is to have a proxy for the REST API that forwards the ’safe’ calls but disallows others. Maybe one already exists? I totally get the security and potentially legal ramifications of having accounts that can push. But it sure seems like this problem is solvable with a bit of elbow grease.
Re: Non committer collaborators on GitHub
> On Dec 14, 2018, at 3:57 AM, Zoran Regvart wrote: > > Hi Builders, > I see some projects like Apache Sling use their own GitHub accounts > via personal access tokens on GitHub. I'm guessing this is a > workaround for not having a non-committer collaborator account that > can be used to update commit status from Jenkins pipelines. > > I too have created an account, I needed one just to bypass the API > limits for anonymous access[1]. But since that account is not a > collaborator on GitHub it cannot update the commit status. I.e. the > end result is: > > Could not update commit status, please check if your scan credentials > belong to a member of the organization or a collaborator of the > repository and repo:status scope is selected > > So one way of fixing this is to use my own GitHub account, which I'm, > understandably hesitant to do. > > Another is to have this non-committer account added as a collaborator, > would this violate any ASF rules? > > And, probably the best one, is to have a ASF wide GitHub account that > builds can use. More or less, +1 . I’m currently going through this whole exercise now. We committed support for Github Branch Source Plug-in (and Github pull request builder) into Apache Yetus and now want to test it. But it’s pretty impossible to do that because the account that we’re using (that’s tied to priv...@yetus.apache.org) doesn’t have enough access permissions to really do much. I do think because of how Github works, an ASF-wide one is probably too dangerous. But I can’t see why private@project accounts couldn’t be added so long as folks don’t do dumb things like auto-push code. There has to be a level of trust here unfortunately though which is why it may not come to fruition. :( Side-rant: I think part of the basic problem here is that Github’s view of permissions is really awful. It is super super dumb that accounts have to have admin-level privileges for repos to use the API to do some basic things that can otherwise be gleaned by just scraping the user-facing website. If anyone from Github is here, I’d love to have a chat. ;)
Re: Can we package release artifacts on builds.a.o?
> On Dec 11, 2018, at 9:43 AM, Joan Touzet wrote: > Thanks, Allen. So I am still fighting against the system here. I view it more as tilting at windmills but tomato, tomato. ;) > If binaries are conveniences, and they are not official, we should be able to > auto-push binaries built on trusted infrastructure out to the world. Why > can't that be our (Infra maintained & supported, costly from a non-profit > perspective) CI/CD infrastructure? Frankly: given how much dumb stuff I see happening on the ASF Jenkins servers on a regular basis, I know I wouldn’t trust them as far as I could throw them. [I’m pretty sure those servers are heavy and I’m not very strong, so that wouldn’t be very far. :) ] All it would take is one person firing off a ‘bad' build that then gets signed by a buildbot account and now ALL of the ASF builds signed by that account are suspect. That would be super bad. From a more philosophical perspective, the current model definitely stresses the idea that the ASF is made up of diverse communities that all have their own (relative) governance. The binary artifacts I’ve done for Apache Yetus take a few minutes and look very different than binary artifacts from other projects. Meanwhile, people would scream bloody murder if the artifact build server were tied up for the ~2-3 hours it takes to make Apache Hadoop while it downloads fresh copies of the hundreds of Docker and Apache Maven dependencies required to build. [Because, I mean, you _are_ building _everything_ from scratch when building these, right???]
Re: A little general purpose documentation on using Jenkinsfiles
> On Dec 11, 2018, at 6:44 AM, Christofer Dutz > wrote: > > Hi all, > > As I have been setting up the builds for several Apache projects, I usually > encountered the same problems and used the same solutions. > > Thinking that this might help other projects, I documented recipes for some > of the typical problems I encountered over time: > > https://cwiki.apache.org/confluence/display/INFRA/Multibranch+Pipeline+recipies > > Feel free to comment, add or correct me … maybe things can be done even > better. This is great! I would love to have it when I was starting with this stuff! That said: I think the #1 thing when dealing with multibranch pipelines is that post’s cleanup step should always have a deleteDir() in it to wipe out the workspace since the workspace directories for multi branch pipelines are hashes. Not doing that means that Jenkins slaves will fill up. :(
Re: Can we package release artifacts on builds.a.o?
> On Dec 11, 2018, at 9:09 AM, Joan Touzet wrote: > > Jenkins users are deploying directly to Nexus with builds. > > Isn't that speaking out of both sides of our mouths at the same time, if Java > developers can push release builds directly to Nexus but non-Java developers > can't? > > Perhaps I'm misunderstanding...are the Nexus-published builds not treated the > same because they're not on dist.apache.org? Or are they not release versions? Yes, you are misunderstanding. 1) Officially (legally?), source code distributions are "the release." Any and all binaries are considered to be convenience binaries so users don’t have to compile. They are not official. [Statements like “verify a release by rebuilding” don’t really parse as a result.] 2) As far as I’m aware/all the projects I’ve ever worked with, the uploads to Nexus are to the snapshot repo, not the release repo. The release repos are still done manually.
Re: Can we package release artifacts on builds.a.o?
> On Dec 7, 2018, at 11:56 PM, Alex Harui wrote: > > > > On 12/7/18, 10:49 PM, "Allen Wittenauer" > wrote: > > > >> On Dec 7, 2018, at 10:22 PM, Alex Harui wrote: >> >> Maven's release plugins commit and push to Git and upload to repository.a.o. >> I saw that some folks have a node that can commit to the a.o website SVN. >> Is anyone already doing releases from builds? What issues are there, if any? > > It's just flat out not secure enough to do a release on. > > Can you give me an example of how it isn't secure enough? The primary purpose of these servers is to run untested, unverified code. Jenkins has some very sharp security corners that makes it trivially un-trustable. Something easy to understand: when Jenkins is configured to run multiple builds on a node, all builds on that node run in the same user space. Because there is no separation between executors, it's very possible for anyone to execute something that modifies another running build. For example, probably the biggest bang for the least amount of work would be to replace jars in the shared maven cache. [... and no, Docker doesn't help.] There are other, bigger problems, but I'd rather not put that out in the public.
Re: Can we package release artifacts on builds.a.o?
> On Dec 7, 2018, at 10:22 PM, Alex Harui wrote: > > Maven's release plugins commit and push to Git and upload to repository.a.o. > I saw that some folks have a node that can commit to the a.o website SVN. Is > anyone already doing releases from builds? What issues are there, if any? It's just flat out not secure enough to do a release on.
YETUS-15 has been committed
[BCC: builds@apache.org since this may impact some folks there...] Apologies for the wide distribution... YETUS-15 [https://issues.apache.org/jira/browse/YETUS-15] has been committed. This patch switches Apache Yetus over to a Maven-based build system. This has a few different impacts: * Any existing jobs that are tied specifically to master will almost certainly break. I'm unsure if any of these still exist, with one exception ... * precommit-admin (the jobs that submits JIRAs to the various precommit-* jobs) has already been fixed to use the new layout. Patch submission via that method should continue uninterrupted. * I'm currently working on getting the Yetus-specific jobs running again. There are quite a few and testing the testing framework is always tricky. :) * Any patches sitting in the YETUS patch tree will obviously need to be rebased. * There are likely some rough spots in the new build and the new Maven plug-in for providing Yetus functionality. Be sure to play with them, file JIRAs, and even better if patches come with it. :) Thanks! === Current release note for YETUS-15 === Apache Yetus has been converted to use Apache Maven as a build tool. As a result, many changes have taken place that directly impacts the project. * Source directories have been re-arranged and re-named: * All bash code is now in (feature)/src/main/shell * All python code is now in (feature)/src/main/python * audience-annotations is mostly unchanged. * releasedocmaker and shelldocs are now available as Jython-built jars. * Introduction of the yetus-minimaven-plugin and yetus-maven-plugins. The yetus-minimaven-plugin is used to build Apache Yetus. yetus-maven-plugin is an end-user artifact that gives access some Apache Yetus features to Apache Maven and compatible build systems without needing any external help (e.g., yetus-wrapper) * Middleman is still used for creating the static website, however, it is now tied into the 'mvn site' command. 'mvn install' MUST be executed before running 'mvn site' as website generation depends upon the yetus-minimaven-plugin. * The content of yetus-project is now in the root of the source tree. * The new yetus-dist module handles the creation of a complete distribution. The artifacts are now in the yetus-dist/target directory. The artifact contents are largely unchanged. New yetus-assemblies module and various Apache Maven configuration files have been added to create distribution parity. * The website is also available as a tar.gz tarball in the yetus-dist artifact area. * The jdiff module is now always built. * Version handling has been modified in several different locations and the executables themselves. Also, other changes introduced: * start-build-env.sh has been added to create a Docker-ized development environment. In particular, this imports the .ssh and .gnupg directories and has all pre-requisites for building Apache Yetus and making releases. * A Dockerfile in root has been added for hub.docker.com and CI-system integration. * The old Dockerfile (previously located at precommit/test-patch-docker and now located at precommit/src/main/shell/test-patch-docker) has been changed to be able also to run releasedocmaker. * Some ruby dependencies for the website have been for security reasons. * JDK8 is now the minimum version of Java used to build the Apache Yetus Java components. * precommit's shellcheck.sh now recognizes src/main/shell as containing shell code to check. * releasedocmaker and shelldocs now explicitly call for python2
Re: Jenkins Slave Workspace Retention
> On Jul 23, 2018, at 4:35 PM, Gav wrote: > > Thanks Allen, > > Some of our nodes are only 364GB in total size, so you can see that this is > an issue. Ugh. > For the H0-H12 nodes we are pretty fine currently with 2.4/2.6TB disks - > therefore the urgency is on the Hadoop nodes H13 - H18 and the non Hadoop > nodes. > > I propose therefore H0-H12 be trimmed on a monthly basis got mtime +31 in > the workspace and the H13-H18 + the remaining nodes with 500GB disk and less > by done > weekly > > Sounds reasonable ? Disclosure: I’m not really doing much with the Hadoop project anymore so someone from that community would need to step forward. But If I Were King: For the small nodes in the Hadoop queue, I’d request they either get pulled out or put into ‘Hadoop-small’ or some other similar name. Doing a quick pass over the directory structure via Jenkins, with only one or two outliers, everything there is ‘reasonable’.. ie., 400G drives are just under-spec’ed for the full workload that the ‘Hadoop’ nodes are expected to do these days. 7 days isn’t going to do it. Putting JUST the nightly jobs on them (hadoop qbt, hbase nightly, maybe a handful of other jobs) would eat plenty of disk space. 7 days then the workspace dir goes away is probably reasonable based on the other nodes though. But it looks to me like there are jobs running on the non-Hadoop nodes that probably should be in the Hadoop queue (Ambari, HBase, Ranger, Zookeeper, probably others). Vice-versa is probably also true. It might also be worthwhile to bug some of the vendors involved to see if they can pony up some machines/cash for build server upgrades like Y!/Oath did/does. That said, I potentially see some changes that the Apache Yetus project can do to lessen the disk space load for those projects that use it. I’ll need to experiment a bit first to be sure. Looking at 10s of G freed up if my hypotheses are correct. That might be enough to not move nodes around in the Hadoop queue but I can’t see that lasting long. Jenkins allegedly has the ability to show compressed log files. It might be worthwhile investigating doing something in this space on a global level. Just gzip up every foo.log in workspace dirs after 24 hours or something. One other thing to keep in mind: the modification time on a directory only changes if a direct child of that directory changes. There are likely many jobs that have a directory structure such that the parent workspace directory time is not modified. Any sort of purge job is going to need to be careful not to nuke a directory structure like this that is being used. :) HTH.
Re: Jenkins Slave Workspace Retention
> On Jul 23, 2018, at 4:07 PM, Gav wrote: > > You are trading latency for disk space here. For the builds I’m aware of, without a doubt. But that’s not necessarily true for all jobs. As Jason Kuster pointed out, in some cases one may be choosing reliability for disk space. (But I guess that reliability depends upon the node. :) ) > Just how long are you proposing that workspaces be kept for - considering > that the non hadoop nodes are running out of disk every day and workspaces > of projects are exceeding 300GB in size, that seems totally over the top in > order to keep a local cache around to save a bit of time. 300GB spread across how many jobs though? All of them? If 300 jobs are using 1G each, that sounds amazingly good given the size of just git repos may eat that much space on super active ones. If it’s a single workspace hitting those numbers, then yes, that’s problematic. Have you tried to talking to the owners of the bigger space hogs individually? I’d be greatly surprised if the majority of people relying upon Jenkins actual read builds@. They are likely unaware their stuff is breaking the universe.
Re: Jenkins Slave Workspace Retention
> On Jul 23, 2018, at 3:04 PM, Joan Touzet wrote: > > > This is why we switched to Docker for ASF Jenkins CI. By pre-building our > Docker container images for CI, we take control over the build environment > in a very proactive way, reducing Infra's investment to just keeping the > build nodes up, running, and with sufficient disk space. All of the projects I’ve been involved with have been using Docker-based builds for a few years now. Experience there has shown that to ease with debugging (esp since the Jenkins machines are so finicky) that information from inside the container needs to be available after the container exits. As a result, Apache Yetus (which is used to control the majority of builds for projects like Hadoop and HBase) will specifically mount key directories from the workspace inside the container so that they are readable after the build finishes. Otherwise one spends a significant amount of time doing a lot of head scratching as to why stuff failed on the Jenkins build servers but not locally. It’s also worth pointing out that “just use Docker” only works if one is building on Linux. That isn’t an option on Windows. This is why a ‘one size fits all’ policy for all jobs isn’t really going to work. Performance on the Windows machines is pretty awful (I’m fairly certain it’s IO), so any time savings there is huge. (For comparison, the last time I looked a Hadoop Linux full build + full analysis: 12 hours, Windows full build + partial analysis: 19 hours… 7 hours difference with stuff turned off!) > It also means that, once a build is done, there is no mess on the Jenkins > build node to clean up - just a regular `docker rm` or `docker rmi` is > sufficient to restore disk space. Infra is already running these aggressively, > since if a build hangs due to an unresponsive docker daemon or network > failure, our post-run script to clean up after ourselves may never run. Apache Yetus pretty much manages the docker repos for the ‘Hadoop’ queue machines since it runs so frequently. It happily deletes stale images after a time as well as killing any stuck containers that are still running after a shorter period of time. This way ‘docker build’ commands can benefit from cache re-use but still get forced to do full rebuilds after a time. I enabled the docker-cleanup functionality as part of the precommit-admin job in January as well, so it’s been working alongside whatever extra docker bits the INFRA team has been using on the non-Hadoop nodes. > We don't put everything into saved artefacts either, but we have built a > simple Apache CouchDB-based database to which we upload any artefacts we > want to save for development purposes only. … and where does this DB run? Also, it’s not so much about the finished artifacts as much as it is about the state of the workspace post-build. If no jars get built, then we want to know what happened. > We had this issue too - which is why we build under a `/tmp` directory > inside the Docker container to avoid one build trashing another build's > workspace directory via the multi-node sync mechanism. Apache Yetus based builds mount a dir inside the container. It’s relatively expensive to rebuild the repo for large projects. For Hadoop, this takes in the 5-10 minute area. That may not seem like a lot. But given the number of build jobs per day, that adds up very quickly. The quicker the big jobs run, the more cycles available for everyone and the faster contributors get feedback on their patches. [Ofc,
Re: Jenkins Slave Workspace Retention
> On Jul 23, 2018, at 10:33 AM, Jason Kuster > wrote: > > +1, also occasionally there are network flakes downloading dependencies and > when we were using Maven we were unable to find a way to get it to retry > dependency downloads so this would routinely fail the build. Great point. I completely forgot about how often the network falls out from underneath the build hosts.
Re: Jenkins Slave Workspace Retention
> On Jul 23, 2018, at 12:45 AM, Gavin McDonald wrote: > > Is there any reason at all to keep the 'workspace' dirs of builds on the > jenkins slaves ? Yes. - Some jobs download and build external dependencies, using the workspace directories as a cache and to avoid sending more work to INFRA. Removing the cache may greatly increase build time, network bandwidth, and potentially increase INFRA’s workload. - This will GREATLY greatly increase pressure on the source repositories, as every job will now do a full git clone/svn checkout. Hadoop’s repo size just passed 700M. - Many jobs don’t put everything into the saved artifacts due to size constraints. Removing the workspace will almost certainly guarantee that artifact usage goes way up as the need to grab (or cache) bits from the workspace will be impossible with an overly aggressive workspace deletion policy. Given how slow IO is on the Windows build hosts, this list is especially critical on them. > And , in advance, I'd like to state that projects creating their own > storage area for jars and other artifacts to quicken up their builds is not a > valid reason. Maven, ant, etc don’t perform directory locks on local repositories. Separate storage areas for jars are key so that multiple executors don’t step all over each other. This was a HUGE problem for a lot of jobs when multiple executors were introduced a few years ago.
Re: Pb releasing apache directory LDAP API
> On Apr 30, 2018, at 9:02 AM, Emmanuel Lécharnywrote: > > Missing Signature: > '/org/apache/directory/api/api-asn1-api/1.0.1/api-asn1-api-1.0.1.jar.asc' > does not exist for 'api-asn1-api-1.0.1.jar'. > ... > > > The .md5 and .sha1 signatures are present though. asc files are PGP signature files. > What could have gone wrong ? Are the directions missing -Papache-release or gpg:sign?
Re: purging of old job artifacts
> On Apr 25, 2018, at 12:04 AM, Chris Lambertuswrote: > > The artifacts do not need to be kept in perpetuity. When every project does > this, there are significant costs in both disk space and performance. Our > policy has been 30 days or 10 jobs retention. That policy wasn’t always in place. > Please dispense with the passive aggressive “unwilling to provide” nonsense. > This is inflammatory and anti-Infra for no valid reason. This process is > meant to be a pragmatic approach to cleaning up and improving a service used > by a large number of projects. The fact that I didn’t have time to post the > job list in the 4 hours since my last reply does not need to be construed as > reticence on Infra’s part to provide it. I apologize. I took Greg’s reply as Infra’s official “Go Pound Sand” response to what I felt was a reasonable request for more information. > Using the yetus jobs as a reference, yetus-java builds 480 and 481 are nearly > a year old, but only contain a few kilobytes of data. While removing them > saves no space, they also provide no value, … to infra. The value to the communities that any job services is really up to those communities to decide. Thank you for providing the data. Now the projects can determine what they need to save and perhaps change process/procedures before infra wipes it out.
Re: purging of old job artifacts
> On Apr 24, 2018, at 4:27 PM, Chris Lambertuswrote: > > The initial artifact list is over 3 million lines long and 590MB. Yikes. OK. How big is the list of jobs? [IIRC, that should be the second part of the file path. e.g., test-ulimit ] That’d give us some sort of scope, who is actually impacted, and hopefully allow everyone to clean up their stuff. :) Thanks
Re: purging of old job artifacts
> On Apr 24, 2018, at 4:13 PM, Chris Lambertuswrote: > > If anyone has concerns over this course of action, please reply here. Could we get a list? Thanks!
Re: Building native (C/C++) code
On Mon, Mar 19, 2018 at 1:16 AM, Jan Lahodawrote: Hi, In Apache NetBeans (incubating), we have some smallish native (C/C++) components. I'd like to ask: are there any official build servers that can build C/C++ code for Windows? (Does not need to be Windows, IMO, could even be cross-compilation.) And if yes, are there existing projects that are using them, so that we could look how to do things properly? FWIW, there are 4 Windows slaves on the Jenkins instance at builds.apache.org if you're more familar with that type of environment. All four have working versions of VS2015 Pro, as we use them for testing Apache Hadoop.
Switched PreCommit-Admin over to Apache Yetus
bcc: builds@apache.org, d...@hbase.apache.org, d...@hive.apache.org, common-...@hadoop.apache.org, d...@phoenix.apache.org, d...@oozie.apache.org (These are all of the projects that had pending issues. This is not all of the groups that actually rely upon this code...) The recent JIRA upgrade broke PreCommit-Admin. This breakage was obviously an unintended consequence. This python code hasn’t been touched in a very long time. In fact, it was still coming from Hadoop SVN tree as it had never been migrated to git. So it isn’t too surprising that after all this time something finally broke it. Luckily, Apache Yetus was already in the progress of adopting it for a variety of reasons that I won't go into here. With the breakage, this work naturally became more urgent. With the help of the Apache Yetus community doing some quick reviews, I just switched PreCommit-Admin over to using the master version of the equivalent code in the Yetus source tree. As soon as the community can get a 0.7.0 release out, we’ll switch it over from master to 0.7.0 so that it can follow our regular release cadence. This also means that JIRA issues can be filled against Yetus for bugs seen in the code base or for feature requests. [Hopefully with code and docs attached. :) ] In any case, with the re-activation of this job, all unprocessed jobs just kicked off. So don't be too surprised by the influx of feedback. As a sidenote, there are some other sticky issues with regards to precommit setups on Jenkins. I'll be sending another note in the future on that though. I've had enough excitement for today. :) Thanks!
Re: [JENKINS] - Main instance and plugin upgrades this weekend
> On Dec 20, 2017, at 2:18 PM, Gavin McDonaldwrote: > > Hi All, > > Jenkins will be going down for a few hours this coming Saturday/Sunday for a > main instance upgrade to the latest LTS release. > > As part of that, all compatible plugins will be upgraded before and/or after > as appropriate. Is it possible to fix INFRA-15685 while the services will be down? Thanks.
Re: New Windows Jenkins Node
> On Nov 30, 2017, at 12:54 PM, Chris Thistlethwaitewrote: > > Greetings, > > Good news everyone! We've been working on puppetizing Windows Jenkins > nodes and have a new build VM that could use some testing and burn-in. > I'm looking for volunteers to point their build to jenkins-win2016-1 to > iron out any issues. Sure, I’ll set the daily Hadoop Windows build (hadoop-trunk-win) to run on it to help destroy... err, I mean burn it in. ;)
Re: Building with docker - Best practices
> On Nov 14, 2017, at 9:17 AM, Thomas Bouronwrote: > > 1. In Jenkins, rather than using a maven type job, I'm using a freestyle type > job to call `docker run ` during the build phase. Is it the right way to > go? All of the projects I’m involved with only ever use freestyle jobs. S… :) > 2. My docker images are based on `maven:alpine` with few extra bits and bobs > on top. All is working fine but, how do I configure jenkins to push built > artifacts (SNAPSHOT) on Apache maven repo? I'm sure other projects do that > but couldn't figure it out so far. This is one of the few jobs that Apache Hadoop doesn’t have dockerized. I think I know what needs to happen (import the global maven settings) but I just haven’t gotten around to building the bits around it yet. I’ll probably write something up and add it to the Apache Yetus toolbox. > 3. Each git submodule requiring a custom docker image will have their own > `Dockerfile` at the root. I was planning to create an extra jenkins job to > build and publish those images to docker hub. Does Apache has an official > account and if yes, should we use that? Otherwise, I'll create an account for > our project only and share the credential with our PMCs. I’m personally not a fan of depending upon docker hub for images. I’d rather build the images as part of the QA pipeline to verify they always work, and if the versions of bits aren’t pinned, to test against the latest. This also allows the Dockerfile to get precommit testing. It’s worth mentioning that all of the projects I’m involved with use Yetus to automate a lot of this stuff. Patch testing uses the same base images as full builds. So if your tests run frequently enough, they’ll stay cached and the build time becomes negligible over the course of the week. As I work on getting the Yetus jenkins plug-in written, this will hopefully be dirt simple for everyone to do without spending any time really learning Yetus.
Jenkins needs to get kicked again.
It’s no longer scheduling jobs. Any clues as to why this happens?
Re: Jenkins slave able to build BED & RPM
> On Oct 30, 2017, at 4:33 AM, Dominik Psennerwrote: > > > On 2017-10-30 11:57, Thomas Bouron wrote: >> Thanks for the reply and links. Went already to [1] but it wasn't clear to >> me what distro each node was (unless going through every one of them but... >> there are a lot) As you said, it seems there isn't a centos or Red Hat >> slave, I'll file a request to INFRA for this then. > > You also have the option to run the build with docker on ubuntu using a > centos docker image. I think it would be wise to evaluate that option before > filing a request to INFRA. The great benefit is that you can build an rpm and > test a built rpm on all the rhel flavored docker images that you would like > to support without the requirement to add additional operating systems or > hardware to the zoo of build slaves. +1 Despite the issues[*], I’m looking forward to a day when INFRA brings the hammer down and requires everyone to use Docker on the Linux machines. I’ve spent the past week looking at why the Jenkins bits have become so unstable on the ‘Hadoop’ nodes. One thing that is obvious is that the jobs running in containers are way easier to manage from the outside. They don’t leave processes hanging about and provides enough hooks to make sure jobs are getting a ‘fair share’ of the node’s resources. Bad actor? Kill the entire container. Bam, gone. That’s before even removing the need to ask for software to be installed. [No need for 900 different versions of Java installed if everyone manages their own…] * - mainly, disk space management and docker-compose creating a complete mess of things.
builds.apache.org broken again?
Is it just me or has Jenkins stopped scheduling jobs again? (We also have a bunch of broken nodes in the Hadoop label.) Thanks.
Re: Problems with Jenkins UI.
> On Oct 19, 2017, at 7:35 PM, Alex Haruiwrote: > > Hi, > > I'm finding that the UI for Jenkins is horribly slow today. I think it’s been a slow for a while. There are good days and there are bad days (like today, where it is bordering on unusable). I’ve been doing trace routes from home to builds.apache.org on a fairly regular basis. The biggest constant is this massive time delay inside telia.net. [ I guess at some point the host moved to Europe?] e.g.: $ traceroute builds.apache.org traceroute to builds.apache.org (62.210.60.235), 64 hops max, 52 byte packets 1 turris (192.168.100.1) 0.901 ms 0.894 ms 0.671 ms 2 192.168.1.254 (192.168.1.254) 1.334 ms 2.572 ms 1.171 ms 3 108-193-0-1.lightspeed.sntcca.sbcglobal.net (108.193.0.1) 22.929 ms 20.937 ms 36.287 ms 4 71.148.135.198 (71.148.135.198) 19.569 ms 18.899 ms 19.737 ms 5 71.145.0.244 (71.145.0.244) 19.547 ms 19.309 ms 20.974 ms 6 12.83.39.145 (12.83.39.145) 22.587 ms 12.83.39.137 (12.83.39.137) 19.878 ms 21.139 ms 7 gar23.sffca.ip.att.net (12.122.114.5) 23.315 ms 21.888 ms 23.886 ms 8 192.205.32.222 (192.205.32.222) 21.677 ms 23.198 ms 21.953 ms 9 nyk-bb4-link.telia.net (62.115.119.228) 98.108 ms * ash-bb3-link.telia.net (80.91.252.221) 88.178 ms 10 prs-bb4-link.telia.net (62.115.135.117) 164.764 ms prs-bb4-link.telia.net (80.91.251.101) 164.284 ms prs-bb3-link.telia.net (213.155.135.4) 184.142 ms 11 prs-b8-link.telia.net (62.115.118.73) 183.718 ms prs-b8-link.telia.net (62.115.136.177) 171.897 ms 170.859 ms 12 online-ic-315748-prs-b8.c.telia.net (62.115.63.94) 174.738 ms 171.917 ms 164.820 ms 13 195.154.1.229 (195.154.1.229) 162.056 ms 177.235 ms 174.669 ms 14 62.210.60.235 (62.210.60.235) 161.606 ms 168.438 ms 173.439 ms
Re: Proactive Jenkins slaves monitoring?
> On Oct 12, 2017, at 8:34 AM, Robert Munteanuwrote: > Jenkins slaves running out of disk space has been an issue for quite > some time. Not a major deal-breaker or very frequent, but it's still > annoying to chase issues, reconfigure slave labels, retrigger builds, > etc From what I’ve seen, the biggest issues are caused by broken docker jobs. I don’t think people realize that when their docker jobs fail, the disk space and container aren’t released. (Docker only automatically cleans up on *success*!) Apache Yetus has tools to deal with old docker bits on the system. As a result, on the ‘hadoop’ labeled machines (which have multiple projects using Yetus precommit in sentinel mode), I don’t think I’ve seen an out of space on those nodes in a very long time. Apache Yetus itself is configured to run on quite a few nodes. When the (rare) patch comes through that runs on a node that isn’t typically running Yetus, it isn’t unusual to see months worth of images eating space and containers still running. It will then wipe out a bunch of the excess. I should probably add df (and cpu time?) output to see how much it is reclaiming. In some cases I’ve seen, it’s easily in the high GB area.
Re: H18 full
> On Sep 22, 2017, at 6:35 AM, Allen Wittenauer <a...@effectivemachines.com> > wrote: > > >> On Sep 22, 2017, at 6:24 AM, Daniel Pono Takamori <p...@apache.org> wrote: >> >> Allen, do you have a link to a job that failed in the way you >> describe? I booped the docker service to be safe, so hopefully it was >> temporal. > > > Most of Hadoop’s jobs that were running on H10 use docker build. I’ll > reconfigure precommit-hadoop-build to only run on H10 and fire off a job or > three. > > Thanks! The two jobs I fired off got past where they were hanging so it looks like H10 is good again. Thanks!
Re: Builds with maven 3.x on Jenkins
> On Sep 15, 2017, at 2:36 PM, Oleg Kalnichevskiwrote: > ERROR: Maven Home \home\jenkins\tools\maven\apache-maven-3.0.5 doesn’t exist That’s the Windows box. > Is there anything I could be doing wrong? Did you mean to run on the Windows box? ;D
Re: Precommit-Admin no longer running on Jenkins
> On Aug 4, 2017, at 7:16 AM, Allen Wittenauer <a...@effectivemachines.com> > wrote: > > >> On Aug 4, 2017, at 7:13 AM, Allen Wittenauer <a...@effectivemachines.com> >> wrote: >> There’s definitely something wrong with scheduling on Jenkins but I’m >> clueless as to what it is. > > > Add yetus-qbt to the list of jobs not getting scheduled. OK, it’s definitely > not just Hadoop then. For those curious, scheduled jobs that should be running still aren’t. I’ve filed INFRA-14798 as a result.
Re: Precommit-Admin no longer running on Jenkins
> On Aug 4, 2017, at 7:13 AM, Allen Wittenauer <a...@effectivemachines.com> > wrote: > There’s definitely something wrong with scheduling on Jenkins but I’m > clueless as to what it is. Add yetus-qbt to the list of jobs not getting scheduled. OK, it’s definitely not just Hadoop then.
Re: Precommit-Admin no longer running on Jenkins
> On Aug 4, 2017, at 7:07 AM, Sean Busbeywrote: > > Oh wait, you meant H for hashing in the crontab. lol. > > /facepalm > > I'll go change that too. :) :D It’s early Friday morning. You’re allowed a few face palms. FWIW: I (manually) kicked off Hadoop’s big jobs. Just like this one, they fired right up with no problems. There’s definitely something wrong with scheduling on Jenkins but I’m clueless as to what it is. I added a Poll SCM w/a H timer to hadoop-trunk-win to see if it fires up tonight. I noticed that a few other jobs that appear to be getting scheduled have that.
Re: Precommit-Admin no longer running on Jenkins
> On Aug 3, 2017, at 3:36 PM, Gavin McDonaldwrote: > > Note that just means the Hadoop nodes, seeing as there is no ‘Ubuntu’ label > any more, its ‘ubuntu’ Oh, actually, I typo’d that. It’s using lowercase ubuntu. :) But it’s definitely more than just precommit-admin that is not getting scheduled. hadoop-trunk-win and hadoop-qbt-trunk-java8-linux-x86 didn’t run, amongst others. It’s starting to look like any job that uses H isn’t getting scheduled.
Re: Precommit-Admin no longer running on Jenkins
> On Aug 3, 2017, at 12:21 PM, Sean Busbeywrote: > > What are the associated node labels for it running? That's the most > common cause of no runs I know of. It’s set for Hadoop||Ubuntu, so there’s definitely nodes. I just disabled Poll SCM and left the timer. I figure it’s worth a shot. Maybe the config update will do something if nothing else.
Timezone change?
Did the timezone on the box change or something else regarding time? I’ve noticed that one of our scheduled builds is now starting ~8-9 hours later. No big deal (other than it eating a slot during prime time), but just wondering if we should adjust all scheduled jobs. Thanks.
Re: Tons of nodes offline/broken
> On Jun 23, 2017, at 10:03 AM, Daniel Pono Takamoriwrote: > > Looks like this is relatively adaptable to deploying for other builds, > if not generalizable to use in a cron. Thanks a bunch! You're welcome. For those that have never seen Apache Yetus activate it's cleanup mode, here's a sample: https://builds.apache.org/job/PreCommit-YETUS-Build/589/consoleFull
Re: Tons of nodes offline/broken
> On Jun 23, 2017, at 7:11 AM, Allen Wittenauer <a...@effectivemachines.com> > wrote: > > >> On Jun 22, 2017, at 11:12 PM, Daniel Pono Takamori <p...@apache.org> wrote: >> >> Docker filled up /var/ so I cleared out the old images. Going to work >> on making sure docker isn't a disk hog in the future. > > > ha. Kind of ironic that my Yetus job failed... it would have cleaned it > up. I guess we should probably make a way to run Yetus' docker cleanup code > independently. We could then schedule a job to run every X days to do the > cleanup automatically. Filed YETUS-523 with a patch to add a 'docker-cleanup' command that just triggers Yetus' Docker cleanup code.
Re: Tons of nodes offline/broken
> On Jun 22, 2017, at 11:12 PM, Daniel Pono Takamoriwrote: > > Docker filled up /var/ so I cleared out the old images. Going to work > on making sure docker isn't a disk hog in the future. ha. Kind of ironic that my Yetus job failed... it would have cleaned it up. I guess we should probably make a way to run Yetus' docker cleanup code independently. We could then schedule a job to run every X days to do the cleanup automatically.
Tons of nodes offline/broken
Hi all. Just noticed that there are a ton of H nodes offline and qnode3 is failing builds because it is out of space. Thanks.