Re: Hadoop Windows Build

2024-05-03 Thread Allen Wittenauer



> On May 3, 2024, at 9:04 AM, Gavin McDonald  wrote:
> 
> Build times are in the order of days, not hours, how is the caching helping
> here?

It won’t help for full builds but for PRs where it only does parts of 
the tree it can be dramatic.  (Remember: this is running Yetus which will only 
rebuild required modules.)



Re: Hadoop Windows Build

2024-05-03 Thread Allen Wittenauer



> On Apr 26, 2024, at 9:42 AM, Cesar Hernandez  wrote:
> 
> My two cents is to use cleanWs() instead of deleteDir() as
> documented in: https://plugins.jenkins.io/ws-cleanup/


If this was a generic, run of the mill build, that could be an option. 
Definitely don’t want to do that for Hadoop builds. There is a bunch of caching 
happening to speed things up.  Deleting them would be _very_ detrimental to 
build times.





Re: ASF Maven parent pom and use properties to define the versions of plugins

2023-06-07 Thread Allen Wittenauer



> On Jun 7, 2023, at 9:36 PM, Christopher  wrote:
> 
> I think your concern about the risks of updating plugin versions is
> valid, but I don't think it has anything to do with how those plugin
> versions are expressed in the parent POM. If anything, using
> properties to express these versions would make it easier for you to
> update the parent POM, but hold back specific plugins when those
> versions cause problems for you. You could also continue doing what
> you're doing now and not update the parent POM. That's perfectly
> valid. I just wonder, if you're going to do that, why care about how
> versions are expressed as properties in newer versions of the parent
> POM, enough to offer a -1 at the idea, if you're not even interested
> in using those newer versions of the parent POM?

I was under the impression that a bunch of _new_ entries were suddenly 
going to happen with this change.  I’m a big fan of less is more in my build 
tools.

Re: ASF Maven parent pom and use properties to define the versions of plugins

2023-06-07 Thread Allen Wittenauer



> On Jun 7, 2023, at 11:46 AM, Karl Heinz Marbaise  wrote:
> 
> Hi,
> On 07.06.23 19:23, Allen Wittenauer wrote:
>> 
>> 
>>> On Jun 5, 2023, at 3:28 PM, Slawomir Jaranowski  
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> I want to introduce properties to define versions of plugins.
>>> I have prepared PR [1] and we have a discussion about properties schema for
>>> such purposes.
>>> 
>>> Because AFS Maven parent is used by other ASF projects, and such changes
>>> can be difficult to change in the next release, I want to know other
>>> opinions.
>> 
>> -1
>> 
>> Some projects are stuck on old versions of the pom because newer ones 
>> introduce plugins with bugs.  e.g., MASSEMBLY-942 stopped some projects on 
>> v21 for a very long time.
> 
> The issue is related to a non Apache API (build-api related to Eclipse)
> abandoned (12 years old+) ...
> And why does a Eclipse related bugs stops you to use that in builds?
> 
> Which plugins are we talking exactly? Which kind of bugs have occurred?

Woops, I meant MASSEMBLY-941, which left a trail of dead in its wake, 
all linked to in the ticket.  

I know I hit a bug in the latest maven pom where it (i’m guessing 
assembly again)  tries to resolve relative symlinks and makes them absolute 
which then in turn blows up with the latest pom. I don’t have time to track it 
down, so I’ll likely just stick with an ancient version of the Apache pom. I 
just don’t have time to debug this stuff. Even though we only release this 
project maybe twice a year, every year it is “can we udpate apache pom? nope.” 
so at least I know I’ll likely just stop even attempting to do it.




Re: ASF Maven parent pom and use properties to define the versions of plugins

2023-06-07 Thread Allen Wittenauer



> On Jun 5, 2023, at 3:28 PM, Slawomir Jaranowski  
> wrote:
> 
> Hi,
> 
> I want to introduce properties to define versions of plugins.
> I have prepared PR [1] and we have a discussion about properties schema for
> such purposes.
> 
> Because AFS Maven parent is used by other ASF projects, and such changes
> can be difficult to change in the next release, I want to know other
> opinions.

-1

Some projects are stuck on old versions of the pom because newer ones introduce 
plugins with bugs.  e.g., MASSEMBLY-942 stopped some projects on v21 for a very 
long time.

So no, the parent pom needs to define less, not more.

[I’m almost to the point of just forking the thing and removing bits because it 
is so wildly unreliable.]



Re: Broken Builds because of gradle-enterprise

2022-12-09 Thread Allen Wittenauer



> On Dec 9, 2022, at 7:43 AM, Greg Stein  wrote:
> 
> We make changes to the Jenkins environment all the time, to keep it
> operational, on the latest software, to provide enough CPU and disk space
> for builds, add requested plugins, and more. We do not advise projects
> before we make changes because we expect no problems to arise. This fell
> into that same kind of "routine change", or so we thought.


From 
https://plugins.jenkins.io/gradle/#plugin-content-gradle-enterprise-integration:

Note - The configuration applies to all builds on all connected agents 
matching the specified label criteria, or all in case no label criteria are 
defined.

So need custom labels and have builds move to those labels if they want to use 
the feature.

Re: Multi-arch container images on DockerHub

2022-12-06 Thread Allen Wittenauer


> On Dec 6, 2022, at 4:43 AM, Robert Munteanu  wrote:
> 
> I see two alternatives so far:
> 
> 1. Moving to GitHub actions

Apache Yetus did the move from docker hub builds to Github Actions 
because ...

> 2. Use hooks to install qemu and 'fake' a multi-arch build on Docker
> Hub


... when I tried to do this a bit over a year ago, the kernel on the 
docker hub machines didn't support qemu. 
https://github.com/docker/roadmap/issues/109 seems to still be open so that 
functionality is likely still missing.  That said, the project kept the hook in 
place in case it is ever supported.  So...


> How are other projects handling this? Or does anyone have any ideas
> that they can share?

... the build is pretty much contained in two files:

https://github.com/apache/yetus/blob/main/.github/workflows/ghcr.yml 
https://github.com/apache/yetus/blob/main/hooks/build

The build file does a lot of extra work that may/may not be desired 
(such as building cascading container images), but it should at least work on 
Linux and macOS.  Being able to run it locally after a bit of multi-arch setup 
is a _huge_ debugging win vs. going all-in on a native GH Action method.

The hooks/build file is also still run on docker hub so as to not break 
users who are still pulling from there.  At some point, we'll have a discussion 
in the project about getting rid of it but for now, everything is relatively 
consistent between the two container repos.  Just that docker hub repo has one 
built for only x86 and the GHCR has both arm and x86 as a multi-arch image.

Some unsolicited advice: keep in mind the bus factor.  A lot of 
projects have wildly complex build systems that are maintained by one maybe two 
people. While those build processes may be faster or better or more functional 
or whatever... at some point other people will need to understand it.



Re: Meeting this Thursday

2022-10-17 Thread Allen Wittenauer



> On Oct 17, 2022, at 6:11 AM, Daan Hoogland  wrote:
> 
> - can a jenkins job be restarted for the apache jenkins server for the
> exact same (merged) code? It seems the analysis job doensn´t start correctly

Broadly, yes.  The caveat here is that Jenkins is branch aware and 
tends to use the last commit to a branch as the place where restarts  tend to 
happen.  This means if you need a _specific_ commit on a branch that has had 
additional commits, then you’ll need something to mark that location with 
something that Jenkins sees (branch, pr, and tags if configured). There are 
some tricks to work around that “limitation.”  But in practice, it generally 
isn’t one since the vaast majority of time, people seem to want to always 
test the latest commit of their branch/pr/tag.

> - can the apache jenkins job be started from a PR if it wasn't added for
> some reason.

The same trick mentioned above is basically to change your job to take 
the repo/commit info and forcibly check that out. But if the PR isn’t showing 
up, something else went wrong, especially if a Scan Repository isn’t pulling in 
the PR but others are there.  (I’m assuming Github Multibranch is being used… 
if something else is being used, Step 1: switch to GHMB!)




Re: Building with Travis - anyone?

2022-09-01 Thread Allen Wittenauer



> On Sep 1, 2022, at 2:52 AM, P. Ottlinger  wrote:
> over the years Travis seems to have degraded: builds fail regularly due to 
> technical issues or the inability to download artifacts from Maven central 
> that are mirrored onto Travis resources.
> 
> Does any other project have stable Travis builds?
> 
...

> 
> GithubAction builds are green, so is ASF Jenkins and local builds 
> 
> Thanks for any opinions, links or hints 
> 

Apache Yetus still has Travis builds and Github Actions running from 
the project's repo. I also run the other Apache Yetus-supported CIs from my 
personal account regularly so that the project doesn't expose the rest of the 
ASF projects to them. (See 'Automation' at 
https://yetus.apache.org/documentation/in-progress/precommit/#optional-plug-ins 
for current list.) The different CIs run nearly the same pipeline

Anecdotally, Travis is generally the worst performing and most 
unreliable out of all of them, on some days by a fairly large factor. To the 
point that I've thought about raising a PR to remove support. So no, it isn't 
just Creadur. 

Travis has also been wildly unpredictable with changes.  (e.g., limits 
on log sizes just got introduced in the past year or so, I seem to recall a 
lower memory limit added, etc) It might be a new one triggering if these 
failures are recent.  But honestly: unless the project _really_ needs Travis, 
I'd recommend migrating off of it.  While it sits somewhere in between Jenkins 
and Github Actions on the complexity scale, one is probably better off either 
dumbing down the build for GHA or going full bore into Jenkins for the heavier 
needs.  (full disclosure: I haven't kept up with the ASF jenkins config since I 
run my own instance for Yetus testing, but I'm assuming it is still more stable 
than Travis given there has been little squawking on builds@ lately. Haha.)

Re: problem using maven to gpg-sign to and upload release artifacts to the nexus repository

2022-06-09 Thread Allen Wittenauer



> On Jun 9, 2022, at 7:51 AM, Rick Hillegas  wrote:
> 
> Thanks for the quick response, Maxim.
> 
> Yes, my credentials are in ~/.m2/settings.xml. Maven is able to upload 
> artifacts and checksums, so the credentials are good. It's the gpg-signing 
> bit that's broken.
> 
> No, I can't ssh to repository.apache.org:
> 
>   mainline (17) > ssh rhille...@repository.apache.org
>   rhille...@repository.apache.org: Permission denied (publickey)


Last I checked, the maven-gpg-plugin (which is what gets called under the hood, 
IIRC) needs to be told which type of gpg you are using on some versions.  So 
you might need to configure it for your particular build environment.  For 
Apache Yetus, we setup a profile to do that:

https://github.com/apache/yetus/blob/fae6b390b06c0f1752fc15221cea4b9cdb7d44dc/pom.xml#L264




Jenkins patch file processing via JIRA

2022-05-04 Thread Allen Wittenauer



Just a quick check:

Is anyone still using Jenkins to test patches attached to JIRA issues 
(usually via the Precommit-admin job)?  If we were to kill it from Apache 
Yetus, would that cause anyone heartache? (I don’t have access to the filter 
that is being used so no idea who is even signed up to use it.)

Thanks.



Re: ephemeral builds via AWS ECS and/or EKS? GPU Nodes?

2021-12-31 Thread Allen Wittenauer



> On Dec 30, 2021, at 10:58 AM, Chris Lambertus  wrote:
> 
> Hi folks,
> 
> We have some funding to explore providing ephemeral builds via ECS or EKS in 
> the Amazon ecosystem, but Infra does not have expertise in this area. We 
> would like to integrate such a service with Jenkins.
> 
> Does anyone have experience with using these services for CI, and would you 
> be interested in assisting Infra in developing a prototype?
> 
> Additionally, we may be able to provide some build nodes with GPUs. Do we 
> have projects which could/would make use of GPUs for integration testing?


At $DAYJOB, I configured the Amazon EC2 plug-in ( 
https://plugins.jenkins.io/ec2 ) to do this type of thing using spot instances 
with labels tied to the particular EC2 node type that our jobs use.  I avoided 
using the EC2 Fleet plug-in  ( https://plugins.jenkins.io/ec2-fleet ) mainly 
because it always seemed to keep at least one node running which is not really 
want you want to get the most bang for your buck. In other words, startup time 
is less important to me than having a node run idle all weekend.

Biggest issues we’ve hit with this setup are:  

a) Depending upon your spot price, you may get outbid and the node gets killed 
out from underneath you (rarely happens but it does happen with our bid)

b) You need to know ahead of time what types of nodes you want to allocate and 
then set a label to match. For the ASF, that might be tricky given a lot of 
people have no idea what the actual requirements for their jobs are.

c) During a Jenkins restart on rare occasions, the plug-in will ‘lose track’ of 
allocated nodes. We have limits for how long our allocations will last  based 
on # of runs and idle time so generally can spot a ‘stuck’ node after a day or 
so.

I haven’t tried configuring it use EKS because none of our stuff needs k8s yet.

Re: Github Token Permissions

2021-12-25 Thread Allen Wittenauer



> On Dec 25, 2021, at 3:42 AM, Gavin McDonald  wrote:
> 
> Hi
> 
> On Sat, Dec 25, 2021 at 12:24 PM Gavin McDonald  wrote:
> I'll take a look, note that Infra has not changed anything, so we can rule 
> that out as a possible cause.
> 
> I see the last two builds failed the test-patch step, but doesnt say why. 
> Can you let me know how you narrowed the failure down to the built in 
> GITHUB_TOKEN ?

If you look at the raw logs, you’ll see Yetus trying to write a github 
status and throwing that error. If it could write, it would tell you why the 
job failed.  Looking at a working vs. not working job setup, it is clear the 
token permissions have changed from write to read.

At this point, I’m just going to assume that we’ll need to code around 
this change. :(. Not sure how we’ll do that, but…






Re: Github Token Permissions

2021-12-24 Thread Allen Wittenauer


The one that actually uses Apache Yetus to test Apache Yetus:

https://github.com/apache/yetus/blob/main/.github/workflows/yetus.yml

"ERROR: Failed to write github status. Token expired or missing repo:status 
write?"

It was working fine a bit over 2 weeks ago and now it isn’t. I forgot that the 
’Set up job’ section actually shows the permissions of the token.  Comparing 
working vs. not-working, it is pretty obvious something has changed. (Given 
what Apache Yetus does, this functionality is _very_ critical…)


> On Dec 24, 2021, at 12:29 AM, Gavin McDonald  wrote:
> 
> Hi Allen,
> 
> Which workflow please?
> 
> On Fri, Dec 24, 2021 at 2:59 AM Allen Wittenauer  wrote:
> 
>> 
>> 
>> Did something change with ASF github token permissions?  It would appear
>> one of our workflows can no longer write Statuses.  (I haven’t checked if
>> Checks still work or not.)
> 
> 
> 
> -- 
> 
> *Gavin McDonald*
> Systems Administrator
> ASF Infrastructure Team



Github Token Permissions

2021-12-23 Thread Allen Wittenauer



Did something change with ASF github token permissions?  It would appear one of 
our workflows can no longer write Statuses.  (I haven’t checked if Checks still 
work or not.)

Re: Pushing Docker Images

2021-11-18 Thread Allen Wittenauer



> On Nov 18, 2021, at 1:27 PM, Chris Lambertus  wrote:
> 
> x86_64 docker on M1 is going to be running under rosetta2 emulation mode (i 
> didn't even know you could do that,) and would potentially be considerably 
> slower than native x86_64 hardware.. The results would likely be different if 
> you were performing this natively on AARCH64... I'm not sure what you meant 
> about building both amd64 and arm64, are you running an arm64 cross compiler 
> on an amd64 emulated docker image on an M1?
> 

Yup.  Docker’s buildx framework allows you to build multiple 
architecture images in an emulation mode simultaneously to avoid all the 
craziness of using manifests to publish the same tag with different 
architectures attached.  More details here: 
https://docs.docker.com/buildx/working-with-buildx/

On Linux, it uses qemu (as above).  On Docker Desktop for Mac… I’m 
honestly not sure what it is doing, but, I’d like to think it is using Rosetta 
2 + secret sauce but it was so slow that I’m actually wondering if it doesn’t 
run qemu-x86 in the VM. :/  I need to spend more time playing with it to see 
what is going on under the hood.

For the version of the Yetus containers sitting in ghcr.io, it was 
built using a single GitHub runner + qemu via GitHub Actions.   You can see 

* the log of the run here: 
https://github.com/apache/yetus/actions/runs/1476885666 (warning: it is big so 
use raw mode)
* the workflow here: 
https://github.com/apache/yetus/blob/main/.github/workflows/ghcr.yml
* the raw docker commands here: 
https://github.com/apache/yetus/blob/main/hooks/build

(Because it is in hooks/build, if Docker Hub ever fixes their stuff, Yetus will 
automatically pick it up.)



Re: Pushing Docker Images

2021-11-18 Thread Allen Wittenauer


PR was merged.  If anyone is curious what multi-arch repo looks like under GHCR:

https://github.com/apache/yetus/pkgs/container/yetus
https://github.com/apache/yetus/pkgs/container/yetus-base



Thanks.

Re: Pushing Docker Images

2021-11-17 Thread Allen Wittenauer



> On Nov 17, 2021, at 4:17 AM, Martin Grigorov  wrote:
> 
>>- In my trials this morning, building both amd64 and arm64 took
>> ~1h. That’s at least better than my M1 Max MBP which never completed after
>> several hours.
>> 
> 
> Did you just say that x86_64+QEMU was faster than M1 Max ?!
> I wouldn't believe it even if I see it with my eyes! :-)

Haha that really does read like that doesn’t it.  :D

My hunch is that I need to give Docker more memory than my usual 4GB 
and it will complete. I just need to find time to try it out. 

Re: Pushing Docker Images

2021-11-16 Thread Allen Wittenauer



> On Nov 16, 2021, at 2:34 AM, Martin Grigorov  wrote:
> 
> Hi Allen,
> 
> I've just documented how one could use Oracle Cloud free plan to build and
> test on Linux ARM64 for free!
> Please check
> https://martin-grigorov.medium.com/github-actions-arm64-runner-on-oracle-cloud-a77cdf7a325a
> and see whether it could be helpful for your case!
> You will need Apache Infra team's help for the token needed by ./config.sh
> and to setup the security/approvals. Everything else could be done by you
> and any member of your team!
> 
> Feedback is welcome!


Thanks!  I’ll take a look at that.

Ironically, I  just opened a PR for multi-arch docker builds for Apache Yetus 
this morning via qemu on GitHub Actions.  Yetus has the huge benefit of already 
having the bits in place to build on Docker hub so that was easily re-used.

https://github.com/apache/yetus/pull/239

For those curious:
- The Apache Yetus image is _huge_ (lots of tooling…) so takes a while 
to build anyway.
- In my trials this morning, building both amd64 and arm64 took ~1h. 
That’s at least better than my M1 Max MBP which never completed after several 
hours.


Next thing will likely be to figure out how to mirror or maybe we’ll just kill 
the apache/yetus images on Dockerhub, depending upon how things work out. 路‍♂️

Pushing Docker Images

2021-11-12 Thread Allen Wittenauer



Hi.

 For those at build multi-arch, what process are people using to push 
images to docker hub?  We’ve been using the automated builder but it doesn’t 
appear to support arm64. I’m debating moving the builder … somewhere… and then 
pushing multi arch that way.

Thoughts?

Thanks.

Re: GA again unreasonably slow (again)

2021-02-08 Thread Allen Wittenauer



> On Feb 8, 2021, at 5:00 PM, Jarek Potiuk  wrote:
> 
>> I'm not convinced this is true. I have yet to see any of my PRs for
> "non-big" projects getting queued while Spark, Airflow, others are.  Thus
> why I think there are only a handful of projects that are getting upset
> about this but the rest of us are like "meh whatever."
> 
> Do you have any data on that? Or is it just anecdotal evidence?

Totally anecdotal.  Like when I literally ran a Yetus PR during the 
builds meeting as you were complaining about Airflow having an X deep queue. My 
PR ran fine, no pause.

> You can see some analysis and actually even charts here:
> https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status

Yes, and I don't even see Yetus showing up.  I wonder how many other 
projects are getting dropped from the dataset

> Maybe you have a very tiny "PR traffic" and it is mostly in the time zone
> that is not affected?

True, it has very tiny PR traffic right now.  (Sep/Oct/Nov was 
different though)  But if it was one big FIFO queue, our PR jobs would also get 
queued.  They aren't even when I go look at one of the other projects that does 
have queued jobs.

When you see Airflow backed up, maybe you should try submitting a PR to 
another project yourself to see what happens.

All I'm saying is: right now, that document feels like it is _greatly_ 
overstating the problem and now that you point it out, clearly dropping data.  
It is problem, to be sure, but not all GitHub Actions projects are suffering.  
(I wouldn't be surprised if smaller projects are actually fast tracked through 
the build queue in order to avoid a tyranny of the majority/resource starvation 
problem... which would be ironic given how much of an issue that is at the ASF.)

Re: GA again unreasonably slow (again)

2021-02-08 Thread Allen Wittenauer


> On Feb 7, 2021, at 4:44 PM, Jarek Potiuk  wrote:
> 
> If you are interested - my document is here. Open for comments - happy to
> add you as editors if you want (just send me your gmail address in priv).
> It is rather crude, I had no time to put a bit more effort into it due to
> some significant changes in my company, but it should be easy to compare
> the values and see the actual improvements we can get. There are likely a
> few shortcuts there and some of the numbers are "back-of-the-envelope" and
> we are going to validate them even more when we implement all the
> optimisations, but the conclusions should be pretty sound.
> 
> https://docs.google.com/document/d/1ZZeZ4BYMNX7ycGRUKAXv0s6etz1g-90Onn5nRQQHOfE/edit#



"For Apache projects, starting December 2020 we are experiencing a high strain 
of GitHub Actions jobs. All Apache projects are sharing 180 jobs and as more 
projects are using GitHub Actions the job queue becomes a serious bottleneck. "

I'm not convinced this is true. I have yet to see any of my PRs for 
"non-big" projects getting queued while Spark, Airflow, others are.  Thus why I 
think there are only a handful of projects that are getting upset about this 
but the rest of us are like "meh whatever."





Re: Failure with Github Actions from outside of the organization (out of a sudden!)

2020-12-28 Thread Allen Wittenauer



> On Dec 27, 2020, at 7:53 AM, kezhenxu94@apache  wrote:
> We (SkyWalking community) are also building some useful tools that may 
> benefit more ASF projects, such as a license audit tool 
> (http://github.com/apache/skywalking-eyes) that I believe most projects will 
> need it

In case you weren't aware, https://creadur.apache.org/rat/ already does 
license auditing.  I think most projects are probably using it at this point. 

Re: GitHub Actions Concurrency Limits for Apache projects

2020-12-21 Thread Allen Wittenauer



> On Dec 20, 2020, at 5:20 PM, Michael A. Smith  wrote:
> 
> The Apache Avro project is looking at switching from a TravisCI/Yetus
> megabuild to GitHub Actions.

If you plan on moving the Yetus portion over to using the Yetus' Github 
Action ( 
https://yetus.apache.org/documentation/0.13.0/precommit/robots/githubactions/ ) 
, it should primarily be copying/moving the personality file to 
.yetus/personality.sh (it will get picked up there automatically) and setting 
up the workflow file.  The rest should "just work."  If it doesn't let us know!



Re: Docker rate limits likely spell DOOM for any Apache project CI workflow relying on Docker Hub

2020-11-02 Thread Allen Wittenauer



> On Oct 29, 2020, at 8:37 AM, Allen Wittenauer 
>  wrote:
> 
> 
> 
>> On Oct 28, 2020, at 11:57 PM, Chris Lambertus  wrote:
>> 
>> Infra would LOVE a smarter way to clean the cache. We have to use a heavy 
>> hammer because there are 300+ projects that want a piece of it, and who 
>> don’t clean up.. We are not build engineers, so we rely on the community to 
>> advise us in dealing with the challenges we face. I would be very happy to 
>> work with you on tooling to improve the cleanup if it improves the 
>> experience for all projects.
> 
>   I'll work on YETUS-1063 so that things make more sense.  But in short, 
> Yetus' "docker-cleanup --sentinel" will  purge container images if they are 
> older than a week, then kill stuck containers after 24 hours. That order 
> prevents running jobs from getting into trouble.  But it also means that in 
> some cases it doesn't look very clean until two or three days later.  But 
> that's ok: it is important to remember that an empty cache is a useless 
> cache.  Those values came from experiences with Hadoop and HBase, but we can 
> certainly add some way to tune them.  Oh, and unlike the docker tools, it 
> pretty much ignores labels.  It does _not_ do anything with volumes, probably 
> something we need to add.

Docs updated!

Relevant pages:

- 
http://yetus.apache.org/documentation/in-progress/precommit/docker-cleanup/
- http://yetus.apache.org/documentation/in-progress/precommit/docker/

Let me know if something doesn't make sense.

Thanks!



Re: Docker rate limits likely spell DOOM for any Apache project CI workflow relying on Docker Hub

2020-10-29 Thread Allen Wittenauer


> On Oct 29, 2020, at 9:21 AM, Joan Touzet  wrote:
> 
> (Sidebar about the script's details)

Sure.

> I tried to read the shell script, but I'm not in the headspace to fully parse 
> it at the moment. If I'm understanding correctly, this will still catch 
> CouchDB's CI docker images if they haven't changed in a week, which happens 
> often enough, negating the cache.

Correct. We actually tried something similar for a while and discovered 
that in a lot of cases, upstream packages would disappear (or worse, have 
security problems) thus making it look the image is still "good" when it's not. 
 So a rebuild weekly at least guarantees some level of "yup, still good" 
without having too much of a negative impact.

> As a project, we're kind of stuck between a rock and a hard place. We want to 
> force a docker pull on the base CI image if it's out of date or the image is 
> corrupted. Otherwise we want to cache forever, not just for a week. I can 
> probably manage the "do we need to re-pull?" bit with some clever CI 
> scripting (check for the latest image hash locally, validate the local image, 
> pull if either fails) but I don't understand how the script resolves the 
> latter.

Most projects that use Yetus for their actual CI testing build the 
image used for the CI as part of the CI.  It is a multi-stage, multi-file 
docker build that has each run use a 'base' Dockerfile (provided by the 
project) that rarely changed and a per-run file that Yetus generates on the 
fly, with both images tagged by either git sha or branch (depending upon 
context). Due to how docker image reference counts on the layers work, this 
makes the docker images effectively used as a "rolling cache" and (beyond a 
potential weekly cache removal) full builds are rare.. thus making them 
relatively cheap (typically <1m runtime) unless the base image had a change far 
up the chain (so structure wisely).  Of course, this also tests the actual 
image of the CI build as part of the CI.  (What tests the testers? philosophy)  
 Given that Jenkins tries really hard to have job affinity, re-runs were still 
cheap after the initial one. [Ofc, now that the cache is getting nuked every 
day]  

Actually, looking at some of the ci-hadoop jobs, it looks like yetus is 
managing the cache on them.  I'm seeing individual run containers from days ago 
at least.  So that's a good sign.

> Can a exemption list be passed to the script so that images matching a 
> certain regex are excluded? You say the script ignores labels entirely, so 
> perhaps not...

Patches accepted. ;)

FWIW, I've been testing on my local machine for unrelated reasons and I 
keep blowing away running containers I care about so I might end up adding it 
myself.  That said: the code was specifically built for CI systems where the 
expectation should be that nothing is permanent.



Re: Docker rate limits likely spell DOOM for any Apache project CI workflow relying on Docker Hub

2020-10-29 Thread Allen Wittenauer



> On Oct 28, 2020, at 11:57 PM, Chris Lambertus  wrote:
> 
> Infra would LOVE a smarter way to clean the cache. We have to use a heavy 
> hammer because there are 300+ projects that want a piece of it, and who don’t 
> clean up.. We are not build engineers, so we rely on the community to advise 
> us in dealing with the challenges we face. I would be very happy to work with 
> you on tooling to improve the cleanup if it improves the experience for all 
> projects.

I'll work on YETUS-1063 so that things make more sense.  But in short, 
Yetus' "docker-cleanup --sentinel" will  purge container images if they are 
older than a week, then kill stuck containers after 24 hours. That order 
prevents running jobs from getting into trouble.  But it also means that in 
some cases it doesn't look very clean until two or three days later.  But 
that's ok: it is important to remember that an empty cache is a useless cache.  
Those values came from experiences with Hadoop and HBase, but we can certainly 
add some way to tune them.  Oh, and unlike the docker tools, it pretty much 
ignores labels.  It does _not_ do anything with volumes, probably something we 
need to add.

Re: Docker rate limits likely spell DOOM for any Apache project CI workflow relying on Docker Hub

2020-10-29 Thread Allen Wittenauer



> On Oct 28, 2020, at 9:01 PM, Joan Touzet  wrote:
> 
> Even for those of us lucky enough to have sponsorship for dedicated CI
> workers, it's still a problem. Infra has scripts to wipe all
> not-currently-in-use Docker containers off of each machine every 24
> hours (or did, last I looked).

Argh.  I really hope this isn't happening again, at least on the 
machines where Apache Yetus' test-patch runs regularly.  It can manage the 
local cache just fine (which is why after we implemented the docker cache 
cleanup code, the Hadoop nodes rarely if ever had docker space problems...).  I 
did separate that part of the code out, so if infra wants a _smarter_ way to 
clean the cache on nodes where test-patch and friends aren't getting used, the 
docker-cleanup utility from Yetus is an option.  (Although, to be fair, that 
utility is poorly documented.  Maybe I'll work on that this week if there is 
interest. )

> 
> 2. Infra provides their own Docker registry. Projects that need images
> can host them there. These will be automatically exempt. Infra will have
> to plan for sufficient storage (this will get big *fast*) and bandwidth
> (same). They will also have to firewall it off from non-Apache projects.

Given that Apache Yetus is about to launch a Github Action to the 
marketplace that uses docker, I've been thinking more and more about pursuing 
access to the ASF's access to github's registry due to all of the fallout of 
Docker, Inc. flailing. Needless to say, firewalls aren't an option for what I'm 
needing.

Re: GitHub Actions Concurrency Limits for Apache projects

2020-10-14 Thread Allen Wittenauer



> On Oct 13, 2020, at 11:04 PM, Jarek Potiuk  wrote:
> This is a logic
> that we have to implement regardless - whether we use yatus or pre-commit
> (please correct me if I am wrong).

I'm not sure about yatus, but for yetus, for the most part, yes, one 
would like to need to implement custom rules in the personality to exactly 
duplicate the overly complicated and over engineered airflow setup.  The big 
difference is that one wouldn't be starting from scratch.  The difference 
engine is already there. The file filter is already there.  full build vs. PR 
handling is already there. etc etc etc

> For all others, this is not a big issue because in total all other
> pre-commits take 2-3 minutes at best. And if we find that we need to
> optimize it further we can simply disable the '--all-files' switch for
> pre-commit and they will only run on the latest commit-changed files
> (pre-commit will only run the tests related to those changed files). But
> since they are pretty fast (except pylint/mypy/flake8) we think running
> them all, for now, is not a problem.

That's what everyone thinks until they start aggregating the time 
across all changes...



Re: GitHub Actions Concurrency Limits for Apache projects

2020-10-13 Thread Allen Wittenauer



> On Oct 13, 2020, at 11:46 AM, Jarek Potiuk  wrote:
> 
> I am rather interested in how those kinds of cases might be handled better
> by Yetus - i.e. how much smarter it can be when selecting which parts of
> the tests should be run - and how you would define such relation. What
> pre-commit is doing is rather straightforward (run tests on files that
> changed), what I did in tests takes into account the "structure" of the
> project and acts accordingly. And those are rather simple to implement. As
> you'd see in my PR it's merely <100 lines in bash to find which files have
> changed and based on some predefined rules select which tests to run. I'd
> be really interested though if Yates can provide some better ways of
> handling it?

I think you are misunderstanding where Yetus sits in the stack.  And I 
also misunderstood where you were running pre-commit; it's clear you aren't 
running it _also_ as part of the CI, just as part of the developer experience.  
(which also means there is an assumption that every PR _has_ run those tools 
...)

  Yetus' precommit probably better thought about as a pre-merge 
utility.  The functionality pre-dates git and commit hooks so...  

Anyway, let's look at Airflow. For example, in Airflow's CI is 
static-checks-pylint 
(https://github.com/apache/airflow/blob/master/.github/workflows/ci.yml).  This 
runs on _every_ PR.

Let's look at https://github.com/apache/airflow/pull/11518.  This is a markdown 
update. There is no python code.  Yet:

https://github.com/apache/airflow/runs/1250824427?check_suite_focus=true

16 minutes blown on an absolutely useless check.  And that's just pylint.  Look 
at the entire CI Build workflow:

https://github.com/apache/airflow/actions/runs/305454304

~45 minutes for likely zero value.  Spell check might be the only thing that 
happened that was actual useful.  That's 45 minutes that something else could 
have been executing.  I haven't looked at the other workflows that also ran but 
probably more wasted time.

Under Yetus, test-patch would have detected this PR was markdown, ran 
markdownlint, blanks, and probably a few other things, and reported the 
results.  It probably would have taken _maybe_ 2 minutes, most of that spent 
dealing with the docker image.  Hooray! 43 minutes back in the executor pool 
for either more Airflow work or for another project!

Is this an extreme example? Sure.  But if you stretch these types of 
cuts over a large amount of PRs, it makes a huge, huge difference. 

Airflow is 'advanced' enough in its CI that using test-patch to cut 
back on _all_ of these workflows is certainly possible using custom plug-ins.  
But it might be easier to use smart-apply-patch's --changedfilesreport option 
to generate a list of files in the PR and then short-circuit workflows based 
upon that list of file changes.  (which reminds me, we need to update the 
smart-apply-patch docs to cover this haha)

===

In the specific case of testing, test-patch will slice and dice tests 
based upon the build tool.  So if you are using, say, maven, it will only run 
the unit tests for that module.  There is nothing an end user needs to do. No 
classifications or anything like that. They get that functionality for free. 
Since Airflow doesn't have a build tool that Yetus supports unfortunately. So 
it wouldn't work out of the box, but it could be shoe-horned in by supplying a 
custom build-tool.  Probably not worth the effort in this specific use case, 
frankly.  





Re: GitHub Actions Concurrency Limits for Apache projects

2020-10-13 Thread Allen Wittenauer



> On Oct 13, 2020, at 9:02 AM, Jarek Potiuk  wrote:
> 
> Yep having pre-commits is cool and we extensively use it as part of our
> setup in Airflow. Since we are heavily Pythonic project we are using the
> fantastic https://pre-commit.com/  framework.

Is pre-commit still "dumb?"  i.e., it treats PRs and branches the same? 
 Because Yetus doesn't.  It gives targeted advice based upon the change.  Which 
makes it faster during the PR cycle which is why the bigger the project, the 
bigger the speed bump.

Re: GitHub Actions Concurrency Limits for Apache projects

2020-10-13 Thread Allen Wittenauer


> On Oct 13, 2020, at 5:00 AM, Jarek Potiuk  wrote:
> Who else is using GitHub Actions extensively?


It's funny you mention that.  A lot of the stress on the ASF Jenkins 
instance was helped tremendously by deploying Apache Yetus.  It might be useful 
for some users on GitHub Actions to take a look at using our current top of 
tree to see if it would be helpful for them as well as to help us test it out 
before we publish to the GitHub Marketplace as part of our next release. (Yetus 
becoming the first ... and maybe only? ... Apache project to publish there.)

Web pages of interest:

* Base page of Apache Yetus' testing facilities:  
http://yetus.apache.org/documentation/in-progress/precommit/

* The specific page about GitHub Actions: 
http://yetus.apache.org/documentation/in-progress/precommit/robots/githubactions/

Note that the _default_ is pure static linting to make the curve a bit 
easier.  To get actual builds, ASF license checks, and a few other things 
turned on, you'll need to set the build tool and likely provide the list of 
plug-ins one wants to turn on.  The list of plug-ins available is on the bottom 
of that first page.

If you have any questions, let us know!  Thanks.

Re: [PROPOSAL] - Change the Descriptions to installed packages on Jenkins

2020-09-23 Thread Allen Wittenauer



> On Sep 23, 2020, at 10:51 AM, Gavin McDonald  wrote:
> 
> The above list would then become:-
> 
> JDK_1.8_latest
> JDK_16_latest
> JDK_1.7.0_79_unlimited security

> Thoughts please. Unless there is some real *strong* objection with
> technical reasons then I intend to make this change in a week or two.


This change doesn't impact me in any way/shape/form (hooray container images... 
never mind that we're not using ASF Jenkins anymore), but if I'm allowed to 
bike shed for a moment, I'd recommend putting _where_ the JDKs come from.  
OpenJDK vs. Oracle JDK vs. Azul vs. whatever all tend to be slightly different. 
 It might save some time later since this change will break the world anyway.

Re: Controlling the images used for the builds/releases

2020-09-13 Thread Allen Wittenauer



> On Sep 13, 2020, at 2:55 PM, Joan Touzet  wrote:
>> I think that any release of ASF software must have corresponding sources
>> that can be use to generate those from. Even if there are some binary
>> files, those too should be generated from some kind of sources or
>> "officially released" binaries that come from some sources. I'd love to get
>> some more concrete examples of where it is not possible.
> 
> Sure, this is totally possible. I'm just saying that the amount of source is 
> extreme in the case where you're talking about a desktop app that runs in 
> Java or Electron (Chrome as a desktop app), as two examples.


... and mostly impossible when talking about Windows containers.



Re: Controlling the images used for the builds/releases

2020-06-22 Thread Allen Wittenauer


> On Jun 22, 2020, at 6:52 AM, Jarek Potiuk  wrote:
> 1) Is this acceptable to have a non-officially released image as a
> dependency in released code for the ASF project?

My understanding the bigger problem is the license of the dependency (and their 
dependencies) rather than the official/unofficial status.  For Apache Yetus' 
test-patch functionality, we defaulted all of our plugins to off because we 
couldn't depend upon GPL'd binaries being available or giving the impression 
that they were required.  By doing so, it put the onus on the user to 
specifically enable features that depends upon GPL'd functionality.  It also 
pretty much nukes any idea of being user friendly. :(

> 2) If it's not - how do we determine which images are "officially
> maintained".

Keep in mind that Docker themselves brand their images as 'official' 
when they actually come from Docker instead of the organizations that own that 
particular piece of software.  It just adds to the complexity.

> 3) If yes - how do we put the boundary - when image is acceptable? Are
> there any criteria we can use or/ constraints we can put on the
> licences/organizations releasing the images we want to make dependencies
> for released code of ours?

License means everything.

> 4) If some images are not acceptable, shoud we bring them in and release
> them in a community-managed registry?

For the Apache Yetus docker image, we're including everything that the 
project supports.  *shrugs*



Re: broken builds taking up resources

2020-01-28 Thread Allen Wittenauer



> On Jan 28, 2020, at 8:02 PM, Chris Lambertus  wrote:
> 
> 
> Allen, can you elaborate on what a “proper” implementation is?  As far as I 
> know, this is baked into jenkins. We could raise process limits for the 
> jenkins user, but these situations only tend to arise when a build has gone 
> off the rails.
> 

You are correct: the limitations come from the implementation of the 
jenkins slave jar. Ideally it would run the slave.jar as one user and executors 
as one or more users.  Or at least use cgroups on Linux and RBAC on Solaris and 
jails on FreeBSD and ... to at least do a minimal amount of work to protect 
itself.  Instead, it depends upon the good will of spawned processes to not 
shoot it or anything else running on the box.  This works great for the 
absolutely simple case, but completely false apart for anything beyond running 
a handful of shell commands.

Thus why I consider it idiotic.  There are ways Jenkins could have done 
some work to prevent this situation from occurring, but alas that is not the 
case.  Yes, it would require more setup of the client, but for those places 
that need (i.e., most) it would have been worth it. 

Instead, on-prem operators are pretty much forced to build a ton of 
complex machinery to prevent users from wreaking havoc. [1] Or give up and move 
to either Jenkins talking to cloud or dump Jenkins entirely.


[1] - The best on-prem solution I came up with  (before I moved my $DAYJOB 
stuff to cloud) was to run each executor in a VM on the box.  That VM would 
also have a regularly scheduled job that would cause it to wipe itself and 
respawn via a trigger mechanism.  Yeah, completely sucks, but at least it 
affords a lot more safety. 

Re: broken builds taking up resources

2020-01-27 Thread Allen Wittenauer



> On Jan 27, 2020, at 10:52 PM, Allen Wittenauer 
>  wrote:
> 
>   This is almost always because whatever is running on the two executors 
> have suffocated the system resources.

... and before I forget, a reminder:  Java threads take up a file descriptor. 
Hadoop's unit tests were firing up 10s of thousands of threads which were 
eating up 10s of thousands of FDs and ultimately lead to "cannot fork, no 
resource" errors causing everything to come tumbling down for the Jenkins salve 
process.  So _all_ the resources, not just RAM or whatever.

Re: broken builds taking up resources

2020-01-27 Thread Allen Wittenauer



> On Jan 27, 2020, at 6:37 PM, Andriy Redko  wrote:
> 
> Thanks a lot for looking into it. From the CXF perspective, I have seen that 
> many CXF builds have been aborted
> because of the connection with master is lost (don't have exact builds to 
> point since we keep only last 3),
> that could probably explain the hanging builds. 


This is almost always because whatever is running on the two executors 
have suffocated the system resources. This ends up starving the Jenkins 
slave.jar, thus causing the disconnect.  (It's extremely important to 
understand that Jenkins' implementation here is sort of brain dead: the 
slave.jar runs as the SAME USER as the jobs being executed.  This is an idiotic 
implementation, but it is what it is.)

Anyway, in my experiences, if all/most of one type of job are failing 
with  the node to appear to be crashed, then there is a good chance that job is 
the cause.  So it would be great if someone could spend the effort to profile 
the CXF jobs to see what their actual resource consumption is.

FWIW, we had this problem with Hadoop, HBase, and others on the 
'Hadoop' label nodes. The answer was to:

a) always run our jobs in containers that could be easily killed (freestyle 
Jenkins jobs that do 'docker run' generally can't be killed, despite what the 
UI says, because the signal never reaches the container)
b) those containers had resource limits 
c) increase the resources that systemd is allowed to give the jenkins user

After doing that, the number of failures on the Hadoop nodes dropped 
exponentially. 

Re: Fair use policy for build agents?

2019-08-25 Thread Allen Wittenauer



> On Aug 25, 2019, at 9:13 AM, Dave Fisher  wrote:
> Why was Hadoop invented in the first place? To take long running tests of new 
> spam filtering algorithms and distribute to multiple computers taking tests 
> from days to hours to minutes.

Well, it was significantly more than that, but ok.

> I really think there needs to be a balance between simple integration tests 
> and full integration.

You’re in luck!  That’s exactly what happens! Amongst other things, 
I’ll be talking about how projects like Apache Hadoop, Apache HBase, and more 
use Apache Yetus to do context sensitive testing at ACNA in a few weeks.

Re: Fair use policy for build agents?

2019-08-24 Thread Allen Wittenauer



> On Aug 23, 2019, at 2:13 PM, Christofer Dutz  
> wrote:
> 
> well I agree that we could possibly split up the job into multiple separate 
> builds. 

I’d highly highly highly recommend it.  Right now, the job effectively 
has a race condition: a job-level timer based upon the assumption that ALL 
nodes in the workflow will be available within that timeframe. That’s not 
feasible long term.

> However this makes running the Jenkins Multibranch pipeline plugin quite a 
> bit more difficult.

Looking at the plc4x Jenkinsfile, prior to INFRA creating the 
’nexus-deploy’ label and pulling H50 from the Ubuntu label, it wouldn’t have 
been THAT difficult.

e.g., this stage:

```
stage('Deploy') {
when {
branch 'develop'
}
// Only the official build nodes have the credentials to deploy 
setup.
agent {
node {
label 'ubuntu'
}
}
steps {
echo 'Deploying'
// Clean up the snapshots directory.
dir("local-snapshots-dir/") {
deleteDir()
}

// Unstash the previously stashed build results.
unstash name: 'plc4x-build-snapshots'

// Deploy the artifacts using the wagon-maven-plugin.
sh 'mvn -f jenkins.pom -X -P deploy-snapshots wagon:upload'

// Clean up the snapshots directory (freeing up more space 
after deploying).
dir("local-snapshots-dir/") {
deleteDir()
}
}
}
```

This seems pretty trivially replaced with build 
(https://jenkins.io/doc/pipeline/steps/pipeline-build-step/#build-build-a-job) 
and copyartifacts.  Just pass the build # as a param between jobs.  

Since the site section also has the same sort of code and problems, a 
Jenkins pipeline library may offer code consolidation facilities to make it 
even easier.

> And the thing is, that our setup has been working fine for about 2 years and 
> we are just recently having these problems. 

Welp, things change.  Lots of project builds break on a regular basis 
because of policy decisions, the increase in load, infra software changes, etc. 
 Consider it very lucky it’s been 2 years.  The big projects get broken on a 
pretty regular basis. (e.g., things like https://s.apache.org/os78x just fall 
from the sky with no warning.  This removal broke GitHub multi branch pipelines 
as well and many projects I know of haven’t switched. It’s just easier to run 
Scan every-so-often thus making the load that much worse ...)

I should probably mention that many many projects already have their 
website and deploy steps separated from their testing job.  It’s significantly 
more efficient on a global/community basis. In my experiences with Jenkins and 
other FIFO job deployment systems (as well as going back to your original 
question):

 fairness is better achieved when the jobs are faster/smaller because 
it gives the scheduler more opportunities to spread the load.

> So I didn't want to just configure the actual problem away, because I think 
> with splitting up the into multiple separate 
> jobs will just Bring other problems and in the end our deploy jobs will then 
> just still hang for many, many hours. 

Instead, this is going to last for another x years and then H50 is 
going to get busy again as everyone moves their deploy step to that node.  
Worse, it’s going to clog up the Ubuntu label even more because those jobs are 
going to tie up the OTHER node that their job is associated with while the H50 
job runs.  plc4x at least as the advantage that it’s only breaking itself when 
it’s occupying the H50 node.  

As mentioned earlier, the ‘websites’ stage has the same issue and will 
likely be the first to break since there are other projects that are already 
using that label.

Re: Fair use policy for build agents?

2019-08-23 Thread Allen Wittenauer


> On Aug 23, 2019, at 9:44 AM, Gavin McDonald  wrote:
> The issue is, and I have seen this multiple times over the last few weeks,
> is that Hadoop pre-commit builds, HBase pre-commit, HBase nightly, HBase
> flaky tests and similar are running on multiple nodes at the same time.

The precommit jobs are exercising potential patches/PRs… of course 
there are going to be multiples running on different nodes simultaneously.  
That’s how CI systems work.

> It
> seems that one PR or 1 commit is triggering a job or jobs that split into
> part jobs that run on multiple nodes.

Unless there is a misconfiguration (and I haven’t been directly 
involved with Hadoop in a year+), that’s incorrect.  There is just that much 
traffic on these big projects.  To put this in perspective, the last time I did 
some analysis in March of this year, it works out to be ~10 new JIRAs with 
patches attached for Hadoop _a day_.  (Assuming an equal distribution across 
the year/month/week/day. Which of course isn’t true.  Weekdays are higher, 
weekends lower.)  If there are multiple iterations on those 10, well….  and 
then there are the PRs...

> Just yesterday I saw Hadoop and HBase
> taking up nearly 45 of 50 H* nodes. Some of these jobs take many hours.
> Some of these jobs that take many hours are triggered on a PR or a commit
> that could be something as trivial as a typo. This is unacceptable.

The size of the Hadoop jobs is one of the reasons why Yahoo!/Oath gave 
the ASF machine resources. (I guess that may have happened before you were part 
of INFRA.)  Also, the job sizes for projects using Yetus are SIGNIFICANTLY 
reduced: the full test suite is about 20 hours.  Big projects are just that, 
big.

> HBase
> in particular is a Hadoop related project and should be limiting its jobs
> to Hadoop labelled nodes H0-H21, but they are running on any and all nodes.

Then you should take that up with the HBase project.

> It is all too familiar to see one job running on a dozen or more executors,
> the build queue is now constantly in the hundreds, despite the fact we have
> nearly 100 nodes. This must stop.

’nearly 100 nodes’: but how many of those are dedicated to specific 
projects?  1/3 of them are just for Cassandra and Beam. 

Also, take a look at the input on the jobs rather than just looking at 
the job names.

It’s probably also worth pointing out that since INFRA mucked with the 
GitHub pull request builder settings, they’ve caused a stampeding herd problem. 
 As soon as someone runs scan on the project, ALL of the PRs get triggered at 
once regardless of if there has been an update to the PR or not.  

> Meanwhile, Chris informs me his single job to deploy to Nexus has been
> waiting in 3 days.

It sure sounds like Chris’ job is doing something weird though, given 
it appears it is switching nodes and such mid-job based upon their description. 
 That’s just begging to starve.

===

Also, looking at the queue this morning (~11AM EDT), a few observations:

* The ‘ubuntu’ queue is pretty busy while ‘hadoop’ has quite a few open slots.

* There are lots of jobs in the queue that don’t support multiple runs.  So 
they are self-starving and the problem lies with the project, not the 
infrastructure.

* A quick pass show that some of the jobs in the queue are tied to specific 
nodes or have such a limited set of nodes as possible hosts that _of course_ 
they are going to get starved out.  Again, a project-level problem.

* Just looking at the queue size is clearly not going to provide any real data 
as what the problems are without also looking into why those jobs are in the 
queue to begin with.

Re: Fair use policy for build agents?

2019-08-23 Thread Allen Wittenauer


Something is not adding up here… or I’m not understanding the issue...


> On Aug 22, 2019, at 6:41 AM, Christofer Dutz  
> wrote:
> we now had one problem several times, that our build is cancelled because it 
> is impossible to get an “ubuntu” node for deploying artifacts.
> Right now I can see the Jenkins build log being flooded with Hadoop PR jobs.


The master build queue will show EVERY job regardless of label and will 
schedule the first job available for that label in the queue (see below).  In 
fact, the hadoop jobs actually have a dedicated label that most of the other 
big jobs (are supposed to) run on:

https://builds.apache.org/label/Hadoop/

Compare this to:

https://builds.apache.org/label/ubuntu/

The nodes between these two are supposed to be distinct.  Of course, 
there are some odd-ball labels out there that have a weird cross-section:

https://builds.apache.org/label/xenial/

Anyway ...

> On Aug 23, 2019, at 5:22 AM, Christofer Dutz  
> wrote:
> 
> the problem is that we’re running our jobs on a dedicated node too …

Is the job running on a dedicated node or the shared ubuntu label?  

> So our build runs smoothly: Doing Tests, Integration Tests, Sonar Analysis, 
> Website generation and then waits to get access to a node that can deploy and 
> here the job just times-out :-/

The job has multiple steps that runs on multiple nodes? If so, you’re 
going to have a bad time if you’ve put a timeout for the entire job.  That’s 
just not realistic.  If it actually needs to run on multiple nodes, why not 
just trigger a new job via a pipeline API call (buildJob) that can sit in the 
queue and take the artifacts from the previously successful run as input?  Then 
it won’t time out.

> Am August 22, 2019 10:41:36 AM UTC schrieb Christofer Dutz 
> :
> Would it be possible to enforce some sort of fair-use policy that one project 
> doesn’t block all the others?


Side note: Jenkins default queuing system is fairly primitive:  pretty much a 
node-based FIFO queue w/a smattering of node affinity. Node has a free slot, 
check first job in the queue. Does it have a label or node property that 
matches?  Run it. If not, go to the next job in the queue.  It doesn’t really 
have any sort of real capacity tracking to prevent starvation. 




Re: External CI Service Limitations

2019-07-03 Thread Allen Wittenauer



> On Jul 3, 2019, at 3:15 PM, Joan Touzet  wrote:
> 
> I was asking if any of the service platforms provided this. So far, it looks 
> like no.

I was playing around bit with Drone today because we actually need ARM 
in $DAYJOB and this convo reminded me that I needed to check it out.

So far, I’m a little underwhelmed with the feature set. (No built-in 
artifacting, no junit output processing, buggy/broken yaml parser,  … to be 
fair, they are relatively new so likely still building these things up) BUT! 
They do support gitlab and acting as a gitlab ci runner. So theoretically one 
could do linux/x86, windows/x86, mac os x, and linux/arm off of a combo of 
gitlab ci + drone.




Re: External CI Service Limitations

2019-07-03 Thread Allen Wittenauer



> On Jul 3, 2019, at 8:04 AM, Joan Touzet  wrote:
> 
> (With my CouchDB release engineer hat on only)
> 
> Anyone know if any of these external services supports platforms other
> than amd64/x86_64?

AFAIK, all of the commercial SaaS companies who have an "open source 
can use for free” bit that I have experience with only do x86.[*]   Most of 
them that have external runners do support ‘bring your own non-x86’ machine 
types.  

There’s also things like OpenLab and the Linux Foundation machines too. 
 (The latter provided PowerPC machine access to ASF Jenkins but I know when I 
tried to use them years ago they were incredibly unstable and the software 
install was extremely lacking to the point of being ultimately unusable for the 
project.)

I’ve never worked with pure pay companies in this area, so there are 
likely companies out there.  It might be worthwhile for someone to do reach out 
to one since we might get free access for Exposure Bucks or something.


> CouchDB keeps receiving a lot of pressure to build on aarch64, ppc64le
> and s390x, which keeps pushing us back to Jenkins CI (ASF or
> independent). And if we have to do that, then not much else matters to us.

One of the nice things about using a system that supports external 
runners is that it allows for contributions of CPU time from like minded 
individuals.  I wouldn’t trust them to do anything more than run tests though.

* - The big problem here is the lack of cost effective non-x86 machines from 
cloud providers. Drone, for example, does do ARM since it’s an extension of 
Packet’s cloud. But I don’t have experience with it so didn’t mention it 
earlier….  Maybe I’ll bang on it over the long weekend.  

Re: External CI Service Limitations

2019-07-03 Thread Allen Wittenauer



> On Jul 2, 2019, at 11:12 PM, Jeff MAURY  wrote:
> 
> Azure pipeline vas the big plus of supporting Linux Windows and macos nodes

There’s a few that support various combinations of non-Linux.  Gitlab 
CI has been there for a while.  Circle CI has had OS X and is in beta with 
Windows.  Cirrus CI has all those plus FreeBSD. etc, etc.  It’s quickly 
becoming required that cloud-based CI systems do more than just throw up a 
Linux box. 

> And i think you can add you nodes to the pools

I think they are limited to being on Azure tho, IIRC.  But I’m probably 
not.  I pretty much gave up on doing anything serious with it.

I really wanted to like pipelines.  The UI is nice.  But in the end, 
Pipelines was one of the more frustrating ones to work with in my 
experience—and that was with some help from the MS folks. It suffers by a death 
of a thousand cuts (lack of complex, real-world examples, custom docker binary, 
pre-populated bits here and there, a ton of env vars, artifact system is a 
total disaster, etc, etc).  Lots of small problems that add up to just not 
being worth the effort.  

Hopefully it’s improved since I last looked at it months and months ago 
though.

Re: External CI Service Limitations

2019-07-03 Thread Allen Wittenauer


> On Jul 2, 2019, at 10:21 PM, Greg Stein  wrote:
> 
> We'll keep this list apprised of anything we find. If anybody knows of,
> and/or can recommend a similar type of outsourced build service ... we
> *absolutely* would welcome pointers.

FWIW, we’ve been collecting them bit by bit into Apache Yetus ( 
http://yetus.apache.org/documentation/in-progress/precommit-robots/ ):

* Azure Pipelines
* Circle CI
* Cirrus CI
* Gitlab CI
* Semaphore CI
* Travis CI

They all have some pros and cons.  I’m not going to rank them or 
anything.

I will say, however, it really feels like Gitlab CI is the best bet to 
pursue since one can add their own runners to the Gitlab CI infrastructure 
dedicated to their own projects.  That ultimately means that replacing Jenkins 
slaves is a very real possibility.

(Also, I’ve requested access to the Github Actions beta, but haven’t 
received anything yet.  I have a hunch that the reworking of the OAuth 
permission model is related, which may make some of these more viable for the 
ASF.)

Re: GitHub PR -> Multi-branch Jenkins pipeline triggering

2019-05-10 Thread Allen Wittenauer



> On May 10, 2019, at 2:04 PM, Zoltán Nagy  wrote:
> 
> In Zipkin (Incubating) we have a multi-branch pipeline building all our
> projects:
> https://builds.apache.org/view/Z/view/Zipkin/job/GH-incubator-zipkin/
> 
> Unfortunately pull-requests don't seem to trigger runs (or perhaps rather,
> repository scans) on any of the projects. Given that we don't have admin
> access to the GH repositories, root-causing this is turning out to be
> tricky. I was hoping you might be able help us figure out what we're
> missing.

Probably the web hook configuration on the GitHub side, which, IIRC, 
has to be done as an INFRA request. e.g., 
https://issues.apache.org/jira/browse/INFRA-17471



Re: Jenkins Build for Heron

2019-04-23 Thread Allen Wittenauer



> On Apr 23, 2019, at 11:50 AM, Josh Fischer  wrote:
> 1. Does the Jenkins box have the build tools listed below already?  Or do
> you think it would be better if I downloaded and installed in the workspace
> for each build?

I’d *highly* recommend using a docker container so that you can control 
the software versions.  Especially with node involved.



Re: PRJenkins builds for Projects

2019-01-12 Thread Allen Wittenauer



> On Jan 12, 2019, at 7:58 AM, Allen Wittenauer 
>  wrote:
> 
> 
>   For Apache Yetus, we do a few things to circumvent this problem, 
> including making sure ${HOME} is defined and doing run specific docker images 
> ( 
> https://github.com/apache/yetus/blob/master/precommit/src/main/shell/test-patch-docker/Dockerfile.patchspecific
>  ) based upon a provided docker file or docker tag. 


Ha. Just noticed there is a bug  in that file.




Re: PRJenkins builds for Projects

2019-01-12 Thread Allen Wittenauer



> On Jan 11, 2019, at 11:23 PM, Dominik Psenner  wrote:
> 
> I can enlist another pain point I faced while implementing the pipeline for
> log4net. I had to find a way of detecting the uid/guid of the jenkins user
> to make it work with dotnet core commandline inside docker. That really got
> my head aching and as far as I can remember stemmed from the fact that the
> dotnet commandline was unable to the detect the home directory and
> attempted to write files into places of the filesystem that it was not
> supposed to. 

I’m guessing you are using maven via the docker agent?  In my 
experiences, this is a combination of docker implementation details, missing 
features in Jenkins, and misleading maven documentation (hint: system property 
user.home does not always equal $HOME!). There is a not-very-well documented 
workaround probably because it demonstrates a security weakness in the Jenkins 
agent+Docker.  Just mount /etc/passwd and friends in your container:

pipeline {
agent {
docker {
image ‘foo’
label ‘bar'
args ‘-v /etc/passwd:/etc/passwd:ro -v 
/etc/group:/etc/group:ro -v ${HOME}:${HOME}’
}
}

…

}

Note that for Docker-in-Docker setups, this sometimes completely falls 
apart (e.g, jnlp and OS X agents) due to the docker daemon running in a 
different namespace of where the jenkins users is.  (OS X’s docker 
implementation just flat out lies about the universe!) There’s also the issue 
of conflicting user/group definitions but it is what it is.  Luckily, this 
usually works well enough to workaround executables that don’t honor $HOME 
variables and instead look at passwd information. 

For Apache Yetus, we do a few things to circumvent this problem, 
including making sure ${HOME} is defined and doing run specific docker images ( 
https://github.com/apache/yetus/blob/master/precommit/src/main/shell/test-patch-docker/Dockerfile.patchspecific
 ) based upon a provided docker file or docker tag. 

I wish Jenkins did something similar.  If it provided a hook to a run 
specific Dockerfile that does the necessary magic before launching so that the 
Jenkins user could be defined we’d all be better off.  (cgroup+user remapping 
might also work but I doubt it.)

Re: PRJenkins builds for Projects

2019-01-11 Thread Allen Wittenauer



> On Jan 10, 2019, at 11:28 PM, Stephen Connolly 
>  wrote:
> 
> On Fri 11 Jan 2019 at 06:28, Joan Touzet  wrote:
>> 
>> I'm willing to believe that Jenkins, the software, is incapable of
> 
> 
> I assume you meant capable rather than incapable.

Nope, I agree with Joan: incapable is probably the correct word.

I’ve lost track of how many issues I’ve hit just in the past week of 
missing or broken features. [1]  It’s very clear (esp with Blue Ocean and 
Pipelines) that Cloudbees is trying very hard to push people into a very 
simplified model of CI.  Anything complex either can’t be done or requires so 
much work that it isn’t practical. [2]  

>> What about buildbot? Or another technology we could use with INFRA's
>> support? Last time I looked at buildbot, its integration with Docker
>> was very poor.
>> 
>> I don't have any special attachment to Jenkins.

IMO, this is probably something we as a community should look into 
doing.  We’re pushing Jenkins way harder that what it feels like it was 
designed to do, given how many issues we hit on a regular basis and some of the 
core limitations of the platform.


1 - My current favorite being JENKINS-17116, which I’ve hit in both freestyle 
and pipeline jobs to the point that I ended up writing pid handling because I 
can’t trust Jenkins to actually signal processes properly.
 
2 - JENKINS-27413… and Glick’s answer just re-affirms in my mind that Jenkins 
is getting dumbed down: why push this off to a plugin?



Re: Can we package release artifacts on builds.a.o?

2019-01-07 Thread Allen Wittenauer



> On Jan 7, 2019, at 11:50 AM, Alex Harui  wrote:
> 
> I don't understand.  Who am I "making" do what work?  And why do at least 3 
> others want something similar?  And what would you propose Royale should do 
> instead?  Always have me cut releases?

If their computers are broken, they could always spin up a free micro 
instance on AWS. Cut the release.  Then spin it down.

But it really makes me wonder how they are developing and testing their 
changes locally if building is such a burden ...



Re: PRJenkins builds for Projects

2019-01-06 Thread Allen Wittenauer



> On Jan 6, 2019, at 10:43 AM, Dominik Psenner  wrote:
> 
> On Sun, Jan 6, 2019, 19:32 Allen Wittenauer
>  
>> 
>> a) The ASF has been running untrusted code since before Github existed.
>> From my casual watching of Jenkins, most of the change code we run doesn’t
>> come from Github PRs.  Any solution absolutely needs to consider what
>> happens in a JIRA-based patch file world. [footnote 1,2]
>> 
> 
> If some project build begins to draw resources in an extraordinary fashion
> it will be noticed.

Strongly disagree. My cleaner code killed three stuck surefire jobs 
that had been looping on a handful of cores since sometime in 2018 yesterday.  
The sling jobs I noted earlier in the week had 20GB of RAM.   That’s even 
before we get into the unit-tests-that-are-really-integration-tests that are 
coming from the big data projects where gigs of memory and thousands of process 
slots are consumed on a regular basis. 

Re: PRJenkins builds for Projects

2019-01-06 Thread Allen Wittenauer


a) The ASF has been running untrusted code since before Github existed.  From 
my casual watching of Jenkins, most of the change code we run doesn’t come from 
Github PRs.  Any solution absolutely needs to consider what happens in a 
JIRA-based patch file world. [footnote 1,2]

b) Making everything get reviewed by a committer before executing is a 
non-starter.  For large communities, precommit testing acts as a way for 
contributors to get feedback prior to a committer even getting involved.  This 
allows for change iteration prior to another human spending time on it.  But 
the secondary effect is that it acts as a funnel: if a project gets thousands 
of change requests a year [footnote 3], it’s now trivial for committers to 
focus their energy on the ones that are closest to commit.

c) We’ve needed disposable environments (what Stephen Connolly called throwaway 
hardware and is similar to what Dominik Psenner talked about wrt gitlab 
runners) for a while.  When INFRA enabled multiple executors per node (which 
they did for good reasons), it triggered an avalanche of problems:  maven’s 
lack of repo locking, noisy neighbors, Jenkins’ problems galore (security and 
DoS which still exist today!), systemd’s cgroup limitations, and a whole lot 
more.  Getting security out of them is really just extra at this point.



1 - With the forced moved to gitbox, this may change, but time will tell.

2 -  FWIW: Gavin and I have been playing with Jenkins’ JIRA Trigger Plugin and 
finding that it’s got some significant weaknesses and needs a lot of support 
code to make viable. This means we’ll likely be sticking with some form of 
Yetus’ precommit-admin for a while longer. :(  So the bright side here is that 
at least the ASF owns the code to make it happen.

3 - Some perspective: Hadoop generated ~6500 JIRAs with patch files attached 
last year alone for the nearly 15 or so active committers to review.  If half 
of the issues had the initial patch plus a single iteration, that’s 13,000 
patches that got tested on Jenkins. 

Re: PRJenkins builds for Projects

2019-01-05 Thread Allen Wittenauer



> On Jan 4, 2019, at 1:06 PM, Joan Touzet  wrote:
> 
> 
> - Original Message -
>> From: "Allen Wittenauer" 
> 
>>  This is the same model the ASF has used for JIRA for a decade+.
>>   It’s always been possible for anyone to submit anything to Jenkins
>>  and have it get executed. Limiting PRs or patch files in JIRAs to
>>  just committers is very anti-community. (This is why all this talk
>>  about using Jenkins for building artifacts I find very
>>  entertaining.  The infrastructure just flat out isn’t built for it
>>  and absolutely requires disposable environments.)
> 
> Then we build a new, additional Jenkins that is committer-only (or PMC-
> only, perhaps, if it's for release purposes). This is a tractable
> problem.

I think people forget that the ASF is a non-profit for individuals.  
It’s not a business.  It’s not a non-profit that requires its members to be 
companies willing to pay astronomical fees.  People-time is almost all 
volunteer.  As such, time to work on these problems is in *extremely* short 
supply, never mind the actual hardware, power, etc, costs.  That’s not even 
covering the legal issues...

> We are stuck at an impasse where people need something to reduce the
> manual workload, and we have an obsolete policy standing in its way.

I’m honestly confused as to why suddenly running scripts on one server 
vs. running them on another one suddenly makes the release process less manual.

> We must be the last organisation in the world where people are forced
> to release software through a manual process.

lol, no, hardly. How many other non-profits have this much software 
with so few paid employees running the show?  

> I don't see why this is something to be gleeful about.

Being entertained is not the same thing as being gleeful.




Re: PRJenkins builds for Projects

2019-01-04 Thread Allen Wittenauer



> On Jan 4, 2019, at 2:00 AM, Christofer Dutz  wrote:
> 
> Hmmm,
> 
> thinking about it ... this is not quite "safe" is it? Just imagining someone 
> starting PRs with maven download-plugin and exec-plugin starting a bitcoin 
> miner or worse ... what does Infra think about this?
> Would prefer the "everyone" PR builds to run on Travis or something that 
> wouldn't harm the ASF.

This is the same model the ASF has used for JIRA for a decade+.  It’s 
always been possible for anyone to submit anything to Jenkins and have it get 
executed. Limiting PRs or patch files in JIRAs to just committers is very 
anti-community. (This is why all this talk about using Jenkins for building 
artifacts I find very entertaining.  The infrastructure just flat out isn’t 
built for it and absolutely requires disposable environments.)



Re: PRJenkins builds for Projects

2019-01-03 Thread Allen Wittenauer



> On Jan 3, 2019, at 7:34 AM, Christofer Dutz  wrote:
> 
> Hi Allen,
> 
> thanks for that ... if I had known that simply selecting the "GitHub" as 
> source instead of the generic "Git" ... would have made things easier ... 
> however it seems that we have exceeded some sort of API usage limit:

Yup.  That’s why we set up our own project-specific user to query GitHub and 
set trust to ‘Everyone’ since our user doesn’t have privs on Github. :/

(See also: last month’s discussion of github’s idiotic permission system.)




Re: Please pick up after yourself

2019-01-03 Thread Allen Wittenauer



> On Jan 3, 2019, at 7:15 AM, Christofer Dutz  wrote:
> 
> Is there a way to check the status of a project? 

There isn’t any sort of global “bad list”. haha.  At least, I know I 
personally gave up trying to keep track of the strays …

> I would like to help improve and have done some things, but I need a way to 
> see that what I'm doing is helping.

Just doing a simple ps -ef | grep “${WORKSPACE}” as a post {} would 
show if anything is hanging about for maven projects (since they tend to full 
path everything, and unless your pipeline is doing bizarro things, $WORKSPACE 
should be specific to your job’s executor).

Re: PRJenkins builds for Projects

2019-01-03 Thread Allen Wittenauer



> On Jan 3, 2019, at 7:14 AM, Christofer Dutz  wrote:
> 
> I can't see that ... where can we find that ... and we don't want to 
> automatically push everything that works.
> 
> From the description of "Enable Git validated merge support" it would 
> automatically push everything that passes the build ... that doesn't sound 
> desirable.
> When I look at the "GitHub-Projekt" plugins description ... this doesn't seem 
> to handle PRs.
> 
> Chris

Take a look at 
https://builds.apache.org/view/S-Z/view/Yetus/job/yetus-github-multibranch/configure
 . It does not do a push, handles PRs, branches, and forks.




Re: Please pick up after yourself

2019-01-03 Thread Allen Wittenauer



> On Jan 3, 2019, at 3:11 AM, Bertrand Delacretaz  
> wrote:
> 
> Hi,
> 
> On Fri, Dec 21, 2018 at 10:53 PM Allen Wittenauer
>  wrote:
> 
>> ...Culprits: Accumulo, Reef, and Sling.
> 
> Sling has a few hundred modules, if you have more specific info on
> which are problematic please let us know so we have a better chance of
> fixing that.

I gave up and wrote a (relatively simple) pre-amble for our jobs to 
shoot any long running processes that are still hanging out in the workspace 
directories. Output gets logged in the console log.

e.g.:

== 

USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND

jenkins  24952  0.0  0.0 3476248   96 ?Sl2018  23:32 
/home/jenkins/tools/java/latest1.7/bin/java -Xmx512m -Xms256m 
-Djava.awt.headless=true -XX:MaxPermSize=256m -Xss256k -jar 
/home/jenkins/jenkins-slave/workspace/jclouds-2.1.x/jclouds-labs-2.1.x/jdbc/target/surefire/surefirebooter8344429480529768484.jar
 
/home/jenkins/jenkins-slave/workspace/jclouds-2.1.x/jclouds-labs-2.1.x/jdbc/target/surefire/surefire6624482576438364006tmp
 
/home/jenkins/jenkins-slave/workspace/jclouds-2.1.x/jclouds-labs-2.1.x/jdbc/target/surefire/surefire_44678967186117353271tmp

Killing 24952 ***

USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
jenkins  53339  0.0  0.4 30068248 462472 ? Sl2018   3:23 
/usr/local/asfpackages/java/jdk1.8.0_191/jre/bin/java -jar 
/home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/surefire/surefirebooter4295922957398927030.jar
 
/home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/surefire/surefire8873399700577323873tmp
 
/home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/surefire/surefire_09146430567560271463tmp

Killing 53339 ***

USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
jenkins  53381  1.5  2.4 13640196 2447672 ?Sl2018  72:48 
/usr/local/asfpackages/java/jdk1.8.0_191/jre/bin/java -Xmx2048m -jar 
/home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/dependency/org.apache.sling.launchpad-8.jar
 -p 42022 -Dsling.run.modes=author,notshared

Killing 53381 ***

USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
jenkins  55854  105  2.4 13638076 2422584 ?Sl2018 4967:06 
/usr/local/asfpackages/java/jdk1.8.0_191/jre/bin/java -Xmx2048m -jar 
/home/jenkins/jenkins-slave/workspace/sling-org-apache-sling-distribution-it-1.8/target/dependency/org.apache.sling.launchpad-8.jar
 -p 38732 -Dsling.run.modes=publish,notshared

Killing 55854 ***

===

BTW, I hope people realize that surefire doesn’t actually report all 
unit test failures.  It makes the assumption that a unit test will write an XML 
file.  If the unit test gets stuck or any number of other things, it won’t get 
reported as a failure.  It’s why maven jobs absolutely need to do a post-action 
to check for these things (and then kill them so they don’t hang around eating 
resources).  Hint: running in a docker container makes the post-action required 
for this much more fool-proof.

I’m also growing more and more suspicious of some of the tuning on the 
build nodes.  I have a hunch that other systemd bits beyond pid limits need to 
get changed since it doesn’t appear that all node resources are actually 
available to the ‘jenkins’ user.  But I can’t pin down exactly which ones they 
are.  I do know that since adding the ‘kill processes > 24 hours’ code, my own 
jobs have only failed due to “can’t exec” errors only once.

Please pick up after yourself

2018-12-21 Thread Allen Wittenauer


I’m now at 4 times this week where my build job has landed on a node that has 
broken JVM tasks hanging about from surefire tests gone awry.  (Culprits: 
Accumulo, Reef, and Sling.) Due to the way Linux does process limits on 
systemd-based boxes, even though there is plenty of CPU and memory, my tasks 
are getting killed because all of these surefire tests have spawned enough 
threads that everything else fails.

Folks:  please, if you aren’t running in a docker container (which makes it 
extremely easy to clean as well as enforce a sub-5k process limit), please add 
a Post Action on your Jenkins job to blow away your tasks that are still 
hanging around. 

At this point, I feel like I have no choice but to just start nuking any long 
running java processes (-agent/slave.jar and the datadog stuff that infra runs) 
before startup just so I can get a build. :(




Re: Non committer collaborators on GitHub

2018-12-14 Thread Allen Wittenauer


> On Dec 14, 2018, at 9:21 AM, Joan Touzet  wrote:
> 
> Allen Wittenauer wrote:
>> I think part of the basic problem here is that Github’s view of permissions 
>> is really awful.  It is super super dumb that accounts have to have 
>> admin-level privileges for repos to use the API to do some basic things that 
>> can otherwise be gleaned by just scraping the user-facing website.  If 
>> anyone from Github is here, I’d love to have a chat. ;)
> 
> FYI I've previously been told we can't use addons to GitHub to improve
> the issue management workflow (like https://waffle.io/) precisely
> because GitHub's permissions model is so poor, allowing an external
> tool to move tickets around requires giving it effectively commit
> access, which is forbidden to third parties.

Putting my thinking cap on, I wonder if the workaround here is to have 
a proxy for the REST API that forwards the ’safe’ calls but disallows others. 
Maybe one already exists? I totally get the security and potentially legal 
ramifications of having accounts that can push.  But it sure seems like this 
problem is solvable with a bit of elbow grease.

Re: Non committer collaborators on GitHub

2018-12-14 Thread Allen Wittenauer


> On Dec 14, 2018, at 3:57 AM, Zoran Regvart  wrote:
> 
> Hi Builders,
> I see some projects like Apache Sling use their own GitHub accounts
> via personal access tokens on GitHub. I'm guessing this is a
> workaround for not having a non-committer collaborator account that
> can be used to update commit status from Jenkins pipelines.
> 
> I too have created an account, I needed one just to bypass the API
> limits for anonymous access[1]. But since that account is not a
> collaborator on GitHub it cannot update the commit status. I.e. the
> end result is:
> 
> Could not update commit status, please check if your scan credentials
> belong to a member of the organization or a collaborator of the
> repository and repo:status scope is selected
> 
> So one way of fixing this is to use my own GitHub account, which I'm,
> understandably hesitant to do.
> 
> Another is to have this non-committer account added as a collaborator,
> would this violate any ASF rules?
> 
> And, probably the best one, is to have a ASF wide GitHub account that
> builds can use.


More or less, +1 .

I’m currently going through this whole exercise now.

We committed support for Github Branch Source Plug-in (and Github pull request 
builder) into Apache Yetus and now want to test it.  But it’s pretty impossible 
to do that because the account that we’re using (that’s tied to 
priv...@yetus.apache.org) doesn’t have enough access permissions to really do 
much.

I do think because of how Github works, an ASF-wide one is probably too 
dangerous.  But I can’t see why private@project accounts couldn’t be added so 
long as folks don’t do dumb things like auto-push code.  There has to be a 
level of trust here unfortunately though which is why it may not come to 
fruition. :(

Side-rant:

I think part of the basic problem here is that Github’s view of permissions is 
really awful.  It is super super dumb that accounts have to have admin-level 
privileges for repos to use the API to do some basic things that can otherwise 
be gleaned by just scraping the user-facing website.  If anyone from Github is 
here, I’d love to have a chat. ;)





Re: Can we package release artifacts on builds.a.o?

2018-12-11 Thread Allen Wittenauer



> On Dec 11, 2018, at 9:43 AM, Joan Touzet  wrote:
> Thanks, Allen. So I am still fighting against the system here.

I view it more as tilting at windmills but tomato, tomato. ;)

> If binaries are conveniences, and they are not official, we should be able to 
> auto-push binaries built on trusted infrastructure out to the world. Why 
> can't that be our (Infra maintained & supported, costly from a non-profit 
> perspective) CI/CD infrastructure?

Frankly:  given how much dumb stuff I see happening on the ASF Jenkins 
servers on a regular basis, I know I wouldn’t trust them as far as I could 
throw them.  [I’m pretty sure those servers are heavy and I’m not very strong, 
so that wouldn’t be very far. :) ]  All it would take is one person firing off 
a ‘bad' build that then gets signed by a buildbot account and now ALL of the 
ASF builds signed by that account are suspect.  That would be super bad.

From a more philosophical perspective, the current model definitely 
stresses the idea that the ASF is made up of diverse communities that all have 
their own (relative) governance.  The binary artifacts I’ve done for Apache 
Yetus take a few minutes and look very different than binary artifacts from 
other projects. Meanwhile, people would scream bloody murder if the artifact 
build server were tied up for the ~2-3 hours it takes to make Apache Hadoop 
while it downloads fresh copies of the hundreds of Docker and Apache Maven 
dependencies required to build.  [Because, I mean, you _are_ building 
_everything_ from scratch when building these, right???]

Re: A little general purpose documentation on using Jenkinsfiles

2018-12-11 Thread Allen Wittenauer



> On Dec 11, 2018, at 6:44 AM, Christofer Dutz  
> wrote:
> 
> Hi all,
> 
> As I have been setting up the builds for several Apache projects, I usually 
> encountered the same problems and used the same solutions.
> 
> Thinking that this might help other projects, I documented recipes for some 
> of the typical problems I encountered over time:
> 
> https://cwiki.apache.org/confluence/display/INFRA/Multibranch+Pipeline+recipies
> 
> Feel free to comment, add or correct me … maybe things can be done even 
> better.

This is great!  I would love to have it when I was starting with this 
stuff!


That said:  I think the #1 thing when dealing with multibranch 
pipelines is that post’s cleanup step should always have a deleteDir() in it to 
wipe out the workspace since the workspace directories for multi branch 
pipelines are hashes.  Not doing that means that Jenkins slaves will fill up. :(

Re: Can we package release artifacts on builds.a.o?

2018-12-11 Thread Allen Wittenauer



> On Dec 11, 2018, at 9:09 AM, Joan Touzet  wrote:
> 
> Jenkins users are deploying directly to Nexus with builds.
> 
> Isn't that speaking out of both sides of our mouths at the same time, if Java 
> developers can push release builds directly to Nexus but non-Java developers 
> can't?
> 
> Perhaps I'm misunderstanding...are the Nexus-published builds not treated the 
> same because they're not on dist.apache.org? Or are they not release versions?

Yes, you are misunderstanding.

1) Officially (legally?), source code distributions are "the release."  
Any and all binaries are considered to be convenience binaries so users don’t 
have to compile.  They are not official.   [Statements like “verify a release 
by rebuilding” don’t really parse as a result.]  

2) As far as I’m aware/all the projects I’ve ever worked with, the 
uploads to Nexus are to the snapshot repo, not the release repo.  The release 
repos are still done manually. 



Re: Can we package release artifacts on builds.a.o?

2018-12-08 Thread Allen Wittenauer



> On Dec 7, 2018, at 11:56 PM, Alex Harui  wrote:
> 
> 
> 
> On 12/7/18, 10:49 PM, "Allen Wittenauer" 
>  wrote:
> 
> 
> 
>> On Dec 7, 2018, at 10:22 PM, Alex Harui  wrote:
>> 
>> Maven's release plugins commit and push to Git and upload to repository.a.o. 
>>  I saw that some folks have a node that can commit to the a.o website SVN.  
>> Is anyone already doing releases from builds?  What issues are there, if any?
> 
>   It's just flat out not secure enough to do a release on.
> 
> Can you give me an example of how it isn't secure enough?


The primary purpose of these servers is to run untested, unverified 
code.  

Jenkins has some very sharp security corners that makes it trivially 
un-trustable.  Something easy to understand: when Jenkins is configured to run 
multiple builds on a node, all builds on that node run in the same user space. 
Because there is no separation between executors, it's very possible for anyone 
to execute something that modifies another running build.  For example, 
probably the biggest bang for the least amount of work would be to replace jars 
in the shared maven cache.

[... and no, Docker doesn't help.]

There are other, bigger problems, but I'd rather not put that out in 
the public.




Re: Can we package release artifacts on builds.a.o?

2018-12-07 Thread Allen Wittenauer



> On Dec 7, 2018, at 10:22 PM, Alex Harui  wrote:
> 
> Maven's release plugins commit and push to Git and upload to repository.a.o.  
> I saw that some folks have a node that can commit to the a.o website SVN.  Is 
> anyone already doing releases from builds?  What issues are there, if any?

It's just flat out not secure enough to do a release on.



YETUS-15 has been committed

2018-11-11 Thread Allen Wittenauer
[BCC: builds@apache.org since this may impact some folks there...]

Apologies for the wide distribution...  

YETUS-15 [https://issues.apache.org/jira/browse/YETUS-15] has been committed.  
This patch switches Apache Yetus over to a Maven-based build system.  This has 
a few different impacts:

* Any existing jobs that are tied specifically to master will almost 
certainly break.  I'm unsure if any of these still exist, with one exception ...

* precommit-admin (the jobs that submits JIRAs to the various 
precommit-* jobs) has already been fixed to use the new layout.  Patch 
submission via that method should continue uninterrupted.

* I'm currently working on getting the Yetus-specific jobs running 
again.  There are quite a few and testing the testing framework is always 
tricky. :)

* Any patches sitting in the YETUS patch tree will obviously need to be 
rebased.

* There are likely some rough spots in the new build and the new Maven 
plug-in for providing Yetus functionality.  Be sure to play with them, file 
JIRAs, and even better if patches come with it. :)

Thanks!


=== Current release note for YETUS-15 ===

Apache Yetus has been converted to use Apache Maven as a build tool. As a 
result, many changes have taken place that directly impacts the project. 

* Source directories have been re-arranged and re-named: 
  * All bash code is now in (feature)/src/main/shell 
  * All python code is now in (feature)/src/main/python 
* audience-annotations is mostly unchanged. 
* releasedocmaker and shelldocs are now available as Jython-built jars. 
* Introduction of the yetus-minimaven-plugin and yetus-maven-plugins. The 
yetus-minimaven-plugin is used to build Apache Yetus. yetus-maven-plugin is an 
end-user artifact that gives access some Apache Yetus features to Apache Maven 
and compatible build systems without needing any external help (e.g., 
yetus-wrapper) 
* Middleman is still used for creating the static website, however, it is now 
tied into the 'mvn site' command. 'mvn install' MUST be executed before running 
'mvn site' as website generation depends upon the yetus-minimaven-plugin. 
* The content of yetus-project is now in the root of the source tree. 
* The new yetus-dist module handles the creation of a complete distribution. 
The artifacts are now in the yetus-dist/target directory. The artifact contents 
are largely unchanged. New yetus-assemblies module and various Apache Maven 
configuration files have been added to create distribution parity. 
* The website is also available as a tar.gz tarball in the yetus-dist artifact 
area. 
* The jdiff module is now always built. 
* Version handling has been modified in several different locations and the 
executables themselves. 

Also, other changes introduced: 

* start-build-env.sh has been added to create a Docker-ized development 
environment. In particular, this imports the .ssh and .gnupg directories and 
has all pre-requisites for building Apache Yetus and making releases. 
* A Dockerfile in root has been added for hub.docker.com and CI-system 
integration. 
* The old Dockerfile (previously located at precommit/test-patch-docker and now 
located at precommit/src/main/shell/test-patch-docker) has been changed to be 
able also to run releasedocmaker. 
* Some ruby dependencies for the website have been for security reasons. 
* JDK8 is now the minimum version of Java used to build the Apache Yetus Java 
components. 
* precommit's shellcheck.sh now recognizes src/main/shell as containing shell 
code to check. 
* releasedocmaker and shelldocs now explicitly call for python2



Re: Jenkins Slave Workspace Retention

2018-07-23 Thread Allen Wittenauer


> On Jul 23, 2018, at 4:35 PM, Gav  wrote:
> 
> Thanks Allen,
> 
> Some of our nodes are only 364GB in total size, so you can see that this is
> an issue.

Ugh.

> For the H0-H12 nodes we are pretty fine currently with 2.4/2.6TB disks -
> therefore  the  urgency is on the Hadoop nodes H13 - H18 and the non Hadoop 
> nodes.
> 
> I propose therefore H0-H12 be trimmed on a monthly basis got mtime +31 in
> the workspace and the H13-H18 + the remaining nodes with 500GB disk and less 
> by done
> weekly
> 
> Sounds reasonable ?

Disclosure: I’m not really doing much with the Hadoop project anymore 
so someone from that community would need to step forward. 

But If I Were King:

For the small nodes in the Hadoop queue, I’d request they either get 
pulled out or put into ‘Hadoop-small’ or some other similar name.  Doing a 
quick pass over the directory structure via Jenkins, with only one or two 
outliers, everything there is ‘reasonable’.. ie., 400G drives are just 
under-spec’ed for the full workload that the ‘Hadoop’ nodes are expected to do 
these days. 7 days isn’t going to do it.  Putting JUST the nightly jobs on them 
(hadoop qbt, hbase nightly, maybe a handful of other jobs) would eat plenty of 
disk space.

7 days then the workspace dir goes away is probably reasonable based on 
the other nodes though.  But it looks to me like there are jobs running on the 
non-Hadoop nodes that probably should be in the Hadoop queue (Ambari, HBase, 
Ranger, Zookeeper, probably others). Vice-versa is probably also true.  It 
might also be worthwhile to bug some of the vendors involved to see if they can 
pony up some machines/cash for build server upgrades like Y!/Oath did/does.

That said, I potentially see some changes that the Apache Yetus project 
can do to lessen the disk space load for those projects that use it.  I’ll need 
to experiment a bit first to be sure.  Looking at 10s of G freed up if my 
hypotheses are correct. That might be enough to not move nodes around in the 
Hadoop queue but I can’t see that lasting long.

Jenkins allegedly has the ability to show compressed log files.  It 
might be worthwhile investigating doing something in this space on a global 
level.  Just gzip up every foo.log in workspace dirs after 24 hours or 
something.

One other thing to keep in mind:  the modification time on a directory 
only changes if a direct child of that directory changes.  There are likely 
many jobs that have a directory structure such that the parent workspace 
directory time is not modified.  Any sort of purge job is going to need to be 
careful not to nuke a directory structure like this that is being used. :)

HTH.

Re: Jenkins Slave Workspace Retention

2018-07-23 Thread Allen Wittenauer


> On Jul 23, 2018, at 4:07 PM, Gav  wrote:
> 
> You are trading latency for disk space here.

For the builds I’m aware of, without a doubt.  But that’s not 
necessarily true for all jobs.  As Jason Kuster pointed out, in some cases one 
may be choosing reliability for disk space. (But I guess that reliability 
depends upon the node. :) )

> Just how long are you proposing that workspaces be kept for  - considering
> that the non hadoop nodes are running out of disk every day and workspaces
> of projects are exceeding 300GB in size, that seems totally over the top in
> order to keep a local cache around to save a bit of time.

300GB spread across how many jobs though?  All of them? If 300 jobs are 
using 1G each, that sounds amazingly good given the size of just git repos may 
eat that much space on super active ones.  If it’s a single workspace hitting 
those numbers, then yes, that’s problematic.

Have you tried to talking to the owners of the bigger space hogs 
individually?  I’d be greatly surprised if the majority of people relying upon 
Jenkins actual read builds@.  They are likely unaware their stuff is breaking 
the universe.




Re: Jenkins Slave Workspace Retention

2018-07-23 Thread Allen Wittenauer


> On Jul 23, 2018, at 3:04 PM, Joan Touzet  wrote:
> 
> 
> This is why we switched to Docker for ASF Jenkins CI. By pre-building our
> Docker container images for CI, we take control over the build environment
> in a very proactive way, reducing Infra's investment to just keeping the
> build nodes up, running, and with sufficient disk space.

All of the projects I’ve been involved with have been using 
Docker-based builds for a few years now.  Experience there has shown that to 
ease with debugging (esp since the Jenkins machines are so finicky) that 
information from inside the container needs to be available after the container 
exits.  As a result, Apache Yetus (which is used to control the majority of 
builds for projects like Hadoop and HBase) will specifically mount key 
directories from the workspace inside the container so that they are readable 
after the build finishes.  Otherwise one spends a significant amount of time 
doing a lot of head scratching as to why stuff failed on the Jenkins build 
servers but not locally.

It’s also worth pointing out that “just use Docker” only works if one 
is building on Linux.  That isn’t an option on Windows.  This is why a ‘one 
size fits all’ policy for all jobs isn’t really going to work.  Performance on 
the Windows machines is pretty awful (I’m fairly certain it’s IO), so any time 
savings there is huge. (For comparison, the last time I looked a Hadoop Linux 
full build + full analysis: 12 hours, Windows full build + partial analysis: 19 
hours… 7 hours difference with stuff turned off!)

> It also means that, once a build is done, there is no mess on the Jenkins
> build node to clean up - just a regular `docker rm` or `docker rmi` is
> sufficient to restore disk space. Infra is already running these aggressively,
> since if a build hangs due to an unresponsive docker daemon or network
> failure, our post-run script to clean up after ourselves may never run.


Apache Yetus pretty much manages the docker repos for the ‘Hadoop’ 
queue machines since it runs so frequently.  It happily deletes stale images 
after a time as well as killing any stuck containers that are still running 
after a shorter period of time.  This way ‘docker build’ commands can benefit 
from cache re-use but still get forced to do full rebuilds after a time.  I 
enabled the docker-cleanup functionality as part of the precommit-admin job in 
January as well, so it’s been working alongside whatever extra docker bits the 
INFRA team has been using on the non-Hadoop nodes.  

> We don't put everything into saved artefacts either, but we have built a
> simple Apache CouchDB-based database to which we upload any artefacts we
> want to save for development purposes only.

… and where does this DB run? Also, it’s not so much about the finished 
artifacts as much as it is about the state of the workspace post-build. If no 
jars get built, then we want to know what happened.

> We had this issue too - which is why we build under a `/tmp` directory
> inside the Docker container to avoid one build trashing another build's
> workspace directory via the multi-node sync mechanism.

Apache Yetus based builds mount a dir inside the container.  It’s 
relatively expensive to rebuild the repo for large projects. For Hadoop, this 
takes in the 5-10 minute area.  That may not seem like a lot. But given the 
number of build jobs per day, that adds up very quickly.  The quicker the big 
jobs run, the more cycles available for everyone and the faster contributors 
get feedback on their patches.  [Ofc, 




Re: Jenkins Slave Workspace Retention

2018-07-23 Thread Allen Wittenauer


> On Jul 23, 2018, at 10:33 AM, Jason Kuster  
> wrote:
> 
> +1, also occasionally there are network flakes downloading dependencies and
> when we were using Maven we were unable to find a way to get it to retry
> dependency downloads so this would routinely fail the build.

Great point. I completely forgot about how often the network falls out 
from underneath the build hosts.

Re: Jenkins Slave Workspace Retention

2018-07-23 Thread Allen Wittenauer


> On Jul 23, 2018, at 12:45 AM, Gavin McDonald  wrote:
> 
> Is there any reason at all to keep the 'workspace' dirs of builds on the
> jenkins slaves ?

Yes.  

- Some jobs download and build external dependencies, using the 
workspace directories as a cache and to avoid sending more work to INFRA.  
Removing the cache may greatly increase build time, network bandwidth, and 
potentially increase INFRA’s workload. 

- This will GREATLY greatly increase pressure on the source 
repositories, as every job will now do a full git clone/svn checkout.  Hadoop’s 
repo size just passed 700M.

- Many jobs don’t put everything into the saved artifacts due to size 
constraints.  Removing the workspace will almost certainly guarantee that 
artifact usage goes way up as the need to grab (or cache) bits from the 
workspace will be impossible with an overly aggressive workspace deletion 
policy.

Given how slow IO is on the Windows build hosts, this list is 
especially critical on them.

> And , in advance, I'd like to state that projects creating their own
> storage area for jars and other artifacts to quicken up their builds is not a 
> valid reason.

Maven, ant, etc don’t perform directory locks on local repositories.  
Separate storage areas for jars are key so that multiple executors don’t step 
all over each other.  This was a HUGE problem for a lot of jobs when multiple 
executors were introduced a few years ago.

Re: Pb releasing apache directory LDAP API

2018-04-30 Thread Allen Wittenauer

> On Apr 30, 2018, at 9:02 AM, Emmanuel Lécharny  wrote:
> 
> Missing Signature:
> '/org/apache/directory/api/api-asn1-api/1.0.1/api-asn1-api-1.0.1.jar.asc'
> does not exist for 'api-asn1-api-1.0.1.jar'.
> ...
> 
> 
> The .md5 and .sha1 signatures are present though.

asc files are PGP signature files.

> What could have gone wrong ?

Are the directions missing -Papache-release or gpg:sign?



Re: purging of old job artifacts

2018-04-25 Thread Allen Wittenauer

> On Apr 25, 2018, at 12:04 AM, Chris Lambertus  wrote:
> 
> The artifacts do not need to be kept in perpetuity. When every project does 
> this, there are significant costs in both disk space and performance. Our 
> policy has been 30 days or 10 jobs retention. 

That policy wasn’t always in place.

> Please dispense with the passive aggressive “unwilling to provide” nonsense. 
> This is inflammatory and anti-Infra for no valid reason. This process is 
> meant to be a pragmatic approach to cleaning up and improving a service used 
> by a large number of projects. The fact that I didn’t have time to post the 
> job list in the 4 hours since my last reply does not need to be construed as 
> reticence on Infra’s part to provide it.

I apologize.  I took Greg’s reply as Infra’s official “Go Pound Sand” 
response to what I felt was a reasonable request for more information.


> Using the yetus jobs as a reference, yetus-java builds 480 and 481 are nearly 
> a year old, but only contain a few kilobytes of data. While removing them 
> saves no space, they also provide no value,

… to infra.

The value to the communities that any job services is really up to 
those communities to decide.  

Thank you for providing the data.  Now the projects can determine what 
they need to save and perhaps change process/procedures before infra wipes it 
out.





Re: purging of old job artifacts

2018-04-24 Thread Allen Wittenauer

> On Apr 24, 2018, at 4:27 PM, Chris Lambertus  wrote:
> 
> The initial artifact list is over 3 million lines long and 590MB.

Yikes. OK.  How big is the list of jobs?  [IIRC, that should be the 
second part of the file path. e.g., test-ulimit ]  That’d give us some sort of 
scope, who is actually impacted, and hopefully allow everyone to clean up their 
stuff. :)


Thanks

Re: purging of old job artifacts

2018-04-24 Thread Allen Wittenauer

> On Apr 24, 2018, at 4:13 PM, Chris Lambertus  wrote:
> 
> If anyone has concerns over this course of action, please reply here.

Could we get a list?

Thanks!



Re: Building native (C/C++) code

2018-03-20 Thread Allen Wittenauer



On Mon, Mar 19, 2018 at 1:16 AM, Jan Lahoda  wrote:

Hi,

In Apache NetBeans (incubating), we have some smallish native (C/C++)
components. I'd like to ask: are there any official build servers 
that can
build C/C++ code for Windows? (Does not need to be Windows, IMO, 
could even
be cross-compilation.) And if yes, are there existing projects that 
are

using them, so that we could look how to do things properly?



FWIW, there are 4 Windows slaves on the Jenkins instance at 
builds.apache.org if you're more familar with that type of environment. 
All four have working versions of VS2015 Pro, as we use them for 
testing Apache Hadoop.





Switched PreCommit-Admin over to Apache Yetus

2018-01-16 Thread Allen Wittenauer
bcc: builds@apache.org, d...@hbase.apache.org, d...@hive.apache.org, 
common-...@hadoop.apache.org, d...@phoenix.apache.org, d...@oozie.apache.org

(These are all of the projects that had pending issues.  This is not 
all of the groups that actually rely upon this code...)

The recent JIRA upgrade broke PreCommit-Admin.  

This breakage was obviously an unintended consequence.  This python 
code hasn’t been touched in a very long time.  In fact, it was still coming 
from Hadoop SVN tree as it had never been migrated to git.  So it isn’t too 
surprising that after all this time something finally broke it.

Luckily, Apache Yetus was already in the progress of adopting it for a 
variety of reasons that I won't go into here.  With the breakage, this work 
naturally became more urgent.  With the help of the Apache Yetus community 
doing some quick reviews, I just switched PreCommit-Admin over to using the 
master version of the equivalent code in the Yetus source tree.  As soon as the 
community can get a 0.7.0 release out, we’ll switch it over from master to 
0.7.0 so that it can follow our regular release cadence.  This also means that 
JIRA issues can be filled against Yetus for bugs seen in the code base or for 
feature requests.  [Hopefully with code and docs attached. :) ]

In any case, with the re-activation of this job, all unprocessed jobs 
just kicked off.  So don't be too surprised by the influx of feedback.

As a sidenote, there are some other sticky issues with regards to 
precommit setups on Jenkins.  I'll be sending another note in the future on 
that though. I've had enough excitement for today. :)

Thanks!



Re: [JENKINS] - Main instance and plugin upgrades this weekend

2017-12-22 Thread Allen Wittenauer

> On Dec 20, 2017, at 2:18 PM, Gavin McDonald  wrote:
> 
> Hi All,
> 
> Jenkins will be going down for a few hours this coming Saturday/Sunday for a 
> main instance upgrade to the latest LTS release.
> 
> As part of that, all compatible plugins will be upgraded before and/or after 
> as appropriate.

Is it possible to fix INFRA-15685 while the services will be down?  

Thanks.

Re: New Windows Jenkins Node

2017-11-30 Thread Allen Wittenauer

> On Nov 30, 2017, at 12:54 PM, Chris Thistlethwaite  wrote:
> 
> Greetings,
> 
> Good news everyone! We've been working on puppetizing Windows Jenkins 
> nodes and have a new build VM that could use some testing and burn-in. 
> I'm looking for volunteers to point their build to jenkins-win2016-1 to 
> iron out any issues. 

Sure, I’ll set the daily Hadoop Windows build (hadoop-trunk-win) to run 
on it to help destroy... err, I mean burn it in. ;)

Re: Building with docker - Best practices

2017-11-14 Thread Allen Wittenauer

> On Nov 14, 2017, at 9:17 AM, Thomas Bouron  wrote:
> 
> 1. In Jenkins, rather than using a maven type job, I'm using a freestyle type 
> job to call `docker run  ` during the build phase. Is it the right way to 
> go?

All of the projects I’m involved with only ever use freestyle jobs.  
S… :)

> 2. My docker images are based on `maven:alpine` with few extra bits and bobs 
> on top. All is working fine but, how do I configure jenkins to push built 
> artifacts (SNAPSHOT) on Apache maven repo? I'm sure other projects do that 
> but couldn't figure it out so far.

This is one of the few jobs that Apache Hadoop doesn’t have dockerized. 
 I think I know what needs to happen (import the global maven settings) but I 
just haven’t gotten around to building the bits around it yet.  I’ll probably 
write something up and add it to the Apache Yetus toolbox.  

> 3. Each git submodule requiring a custom docker image will have their own 
> `Dockerfile` at the root. I was planning to create an extra jenkins job to 
> build and publish those images to docker hub. Does Apache has an official 
> account and if yes, should we use that? Otherwise, I'll create an account for 
> our project only and share the credential with our PMCs.

I’m personally not a fan of depending upon docker hub for images. I’d 
rather build the images as part of the QA pipeline to verify they always work, 
and if the versions of bits aren’t pinned, to test against the latest. This 
also allows the Dockerfile to get precommit testing.

It’s worth mentioning that all of the projects I’m involved with use 
Yetus to automate a lot of this stuff.  Patch testing uses the same base images 
as full builds. So if your tests run frequently enough, they’ll stay cached and 
the build time becomes negligible over the course of the week.  

As I work on getting the Yetus jenkins plug-in written, this will 
hopefully be dirt simple for everyone to do without spending any time really 
learning Yetus.


Jenkins needs to get kicked again.

2017-10-31 Thread Allen Wittenauer

It’s no longer scheduling jobs.

Any clues as to why this happens?




Re: Jenkins slave able to build BED & RPM

2017-10-30 Thread Allen Wittenauer

> On Oct 30, 2017, at 4:33 AM, Dominik Psenner  wrote:
> 
> 
> On 2017-10-30 11:57, Thomas Bouron wrote:
>> Thanks for the reply and links. Went already to [1] but it wasn't clear to 
>> me what distro each node was (unless going through every one of them but... 
>> there are a lot) As you said, it seems there isn't a centos or Red Hat 
>> slave, I'll file a request to INFRA for this then.
> 
> You also have the option to run the build with docker on ubuntu using a 
> centos docker image. I think it would be wise to evaluate that option before 
> filing a request to INFRA. The great benefit is that you can build an rpm and 
> test a built rpm on all the rhel flavored docker images that you would like 
> to support without the requirement to add additional operating systems or 
> hardware to the zoo of build slaves.

+1

Despite the issues[*], I’m looking forward to a day when INFRA brings 
the hammer down and requires everyone to use Docker on the Linux machines.  
I’ve spent the past week looking at why the Jenkins bits have become so 
unstable on the ‘Hadoop’ nodes.  One thing that is obvious is that the jobs 
running in containers are way easier to manage from the outside.  They don’t 
leave processes hanging about and provides enough hooks to make sure jobs are 
getting a ‘fair share’ of the node’s resources. Bad actor? Kill the entire 
container. Bam, gone. That’s before even removing the need to ask for software 
to be installed. [No need for 900 different versions of Java installed if 
everyone manages their own…]

* - mainly, disk space management and docker-compose creating a complete mess 
of things. 

builds.apache.org broken again?

2017-10-23 Thread Allen Wittenauer

Is it just me or has Jenkins stopped scheduling jobs again?

(We also have a bunch of broken nodes in the Hadoop label.)

Thanks.


Re: Problems with Jenkins UI.

2017-10-20 Thread Allen Wittenauer

> On Oct 19, 2017, at 7:35 PM, Alex Harui  wrote:
> 
> Hi,
> 
> I'm finding that the UI for Jenkins is horribly slow today.

I think it’s been a slow for a while.  There are good days and there are bad 
days (like today, where it is bordering on unusable).  I’ve been doing trace 
routes from home to builds.apache.org on a fairly regular basis.  The biggest 
constant is this massive time delay inside telia.net. [ I guess at some point 
the host moved to Europe?]

e.g.:

$ traceroute builds.apache.org
traceroute to builds.apache.org (62.210.60.235), 64 hops max, 52 byte packets
 1  turris (192.168.100.1)  0.901 ms  0.894 ms  0.671 ms
 2  192.168.1.254 (192.168.1.254)  1.334 ms  2.572 ms  1.171 ms
 3  108-193-0-1.lightspeed.sntcca.sbcglobal.net (108.193.0.1)  22.929 ms  
20.937 ms  36.287 ms
 4  71.148.135.198 (71.148.135.198)  19.569 ms  18.899 ms  19.737 ms
 5  71.145.0.244 (71.145.0.244)  19.547 ms  19.309 ms  20.974 ms
 6  12.83.39.145 (12.83.39.145)  22.587 ms
12.83.39.137 (12.83.39.137)  19.878 ms  21.139 ms
 7  gar23.sffca.ip.att.net (12.122.114.5)  23.315 ms  21.888 ms  23.886 ms
 8  192.205.32.222 (192.205.32.222)  21.677 ms  23.198 ms  21.953 ms
 9  nyk-bb4-link.telia.net (62.115.119.228)  98.108 ms *
ash-bb3-link.telia.net (80.91.252.221)  88.178 ms
10  prs-bb4-link.telia.net (62.115.135.117)  164.764 ms
prs-bb4-link.telia.net (80.91.251.101)  164.284 ms
prs-bb3-link.telia.net (213.155.135.4)  184.142 ms
11  prs-b8-link.telia.net (62.115.118.73)  183.718 ms
prs-b8-link.telia.net (62.115.136.177)  171.897 ms  170.859 ms
12  online-ic-315748-prs-b8.c.telia.net (62.115.63.94)  174.738 ms  171.917 ms  
164.820 ms
13  195.154.1.229 (195.154.1.229)  162.056 ms  177.235 ms  174.669 ms
14  62.210.60.235 (62.210.60.235)  161.606 ms  168.438 ms  173.439 ms








Re: Proactive Jenkins slaves monitoring?

2017-10-12 Thread Allen Wittenauer

> On Oct 12, 2017, at 8:34 AM, Robert Munteanu  wrote:
> Jenkins slaves running out of disk space has been an issue for quite
> some time. Not a major deal-breaker or very frequent, but it's still
> annoying to chase issues, reconfigure slave labels, retrigger builds,
> etc


From what I’ve seen, the biggest issues are caused by broken docker 
jobs.  I don’t think people realize that when their docker jobs fail, the disk 
space and container aren’t released. (Docker only automatically cleans up on 
*success*!)  Apache Yetus has tools to deal with old docker bits on the system. 
As a result, on the ‘hadoop’ labeled machines (which have multiple projects 
using Yetus precommit in sentinel mode), I don’t think I’ve seen an out of 
space on those nodes in a very long time.

Apache Yetus itself is configured to run on quite a few nodes.  When 
the (rare) patch comes through that runs on a node that isn’t typically running 
Yetus, it isn’t unusual to see months worth of images eating space and 
containers still running.  It will then wipe out a bunch of the excess.  I 
should probably add df (and cpu time?) output to see how much it is reclaiming. 
 In some cases I’ve seen, it’s easily in the high GB area.



Re: H18 full

2017-09-22 Thread Allen Wittenauer

> On Sep 22, 2017, at 6:35 AM, Allen Wittenauer <a...@effectivemachines.com> 
> wrote:
> 
> 
>> On Sep 22, 2017, at 6:24 AM, Daniel Pono Takamori <p...@apache.org> wrote:
>> 
>> Allen, do you have a link to a job that failed in the way you
>> describe?  I booped the docker service to be safe, so hopefully it was
>> temporal.
> 
> 
>   Most of Hadoop’s jobs that were running on H10 use docker build.  I’ll 
> reconfigure precommit-hadoop-build to only run on H10 and fire off a job or 
> three.
> 
>   Thanks!


The two jobs I fired off got past where they were hanging so it looks 
like H10 is good again.

Thanks!

Re: Builds with maven 3.x on Jenkins

2017-09-15 Thread Allen Wittenauer

> On Sep 15, 2017, at 2:36 PM, Oleg Kalnichevski  wrote:
> ERROR: Maven Home \home\jenkins\tools\maven\apache-maven-3.0.5 doesn’t exist

That’s the Windows box.

> Is there anything I could be doing wrong? 

Did you mean to run on the Windows box?

;D




Re: Precommit-Admin no longer running on Jenkins

2017-08-05 Thread Allen Wittenauer

> On Aug 4, 2017, at 7:16 AM, Allen Wittenauer <a...@effectivemachines.com> 
> wrote:
> 
> 
>> On Aug 4, 2017, at 7:13 AM, Allen Wittenauer <a...@effectivemachines.com> 
>> wrote:
>> There’s definitely something wrong with scheduling on Jenkins but I’m 
>> clueless as to what it is. 
> 
> 
> Add yetus-qbt to the list of jobs not getting scheduled.  OK, it’s definitely 
> not just Hadoop then.

For those curious, scheduled jobs that should be running still aren’t.

I’ve filed INFRA-14798 as a result.

Re: Precommit-Admin no longer running on Jenkins

2017-08-04 Thread Allen Wittenauer

> On Aug 4, 2017, at 7:13 AM, Allen Wittenauer <a...@effectivemachines.com> 
> wrote:
>  There’s definitely something wrong with scheduling on Jenkins but I’m 
> clueless as to what it is. 


Add yetus-qbt to the list of jobs not getting scheduled.  OK, it’s definitely 
not just Hadoop then.

Re: Precommit-Admin no longer running on Jenkins

2017-08-04 Thread Allen Wittenauer

> On Aug 4, 2017, at 7:07 AM, Sean Busbey  wrote:
> 
> Oh wait, you meant H for hashing in the crontab. lol.
> 
> /facepalm
> 
> I'll go change that too. :)

:D

It’s early Friday morning.  You’re allowed a few face palms. 

FWIW: I (manually) kicked off Hadoop’s big jobs.  Just like this one, 
they fired right up with no problems.  There’s definitely something wrong with 
scheduling on Jenkins but I’m clueless as to what it is.  I added a Poll SCM 
w/a H timer to hadoop-trunk-win to see if it fires up tonight.  I noticed that 
a few other jobs that appear to be getting scheduled have that.




Re: Precommit-Admin no longer running on Jenkins

2017-08-04 Thread Allen Wittenauer

> On Aug 3, 2017, at 3:36 PM, Gavin McDonald  wrote:
> 
> Note that just means the Hadoop nodes, seeing as there is no ‘Ubuntu’ label 
> any more, its ‘ubuntu’


Oh, actually, I typo’d that.  It’s using lowercase ubuntu. :)

But it’s definitely more than just precommit-admin that is not getting 
scheduled. hadoop-trunk-win and 
hadoop-qbt-trunk-java8-linux-x86 didn’t run, amongst others.  It’s 
starting to look like any job that uses H isn’t getting scheduled.

Re: Precommit-Admin no longer running on Jenkins

2017-08-03 Thread Allen Wittenauer

> On Aug 3, 2017, at 12:21 PM, Sean Busbey  wrote:
> 
> What are the associated node labels for it running? That's the most
> common cause of no runs I know of.


It’s set for Hadoop||Ubuntu, so there’s definitely nodes.

I just disabled Poll SCM and left the timer.  I figure it’s worth a shot.  
Maybe the config update will do something if nothing else.

Timezone change?

2017-07-21 Thread Allen Wittenauer

Did the timezone on the box change or something else regarding time?  
I’ve noticed that one of our scheduled builds is now starting ~8-9 hours later. 
 No big deal (other than it eating a slot during prime time), but just 
wondering if we should adjust all scheduled jobs.

Thanks.

Re: Tons of nodes offline/broken

2017-06-23 Thread Allen Wittenauer

> On Jun 23, 2017, at 10:03 AM, Daniel Pono Takamori  wrote:
> 
> Looks like this is relatively adaptable to deploying for other builds,
> if not generalizable to use in a cron.  Thanks a bunch!


You're welcome.  For those that have never seen Apache Yetus activate 
it's cleanup mode, here's a sample:


https://builds.apache.org/job/PreCommit-YETUS-Build/589/consoleFull



Re: Tons of nodes offline/broken

2017-06-23 Thread Allen Wittenauer

> On Jun 23, 2017, at 7:11 AM, Allen Wittenauer <a...@effectivemachines.com> 
> wrote:
> 
> 
>> On Jun 22, 2017, at 11:12 PM, Daniel Pono Takamori <p...@apache.org> wrote:
>> 
>> Docker filled up /var/ so I cleared out the old images.  Going to work
>> on making sure docker isn't a disk hog in the future.
> 
> 
>   ha. Kind of ironic that my Yetus job failed... it would have cleaned it 
> up.  I guess we should probably make a way to run Yetus' docker cleanup code 
> independently. We could then schedule a job to run every X days to do the 
> cleanup automatically.


Filed YETUS-523 with a patch to add a 'docker-cleanup' command that 
just triggers Yetus' Docker cleanup code.

Re: Tons of nodes offline/broken

2017-06-23 Thread Allen Wittenauer

> On Jun 22, 2017, at 11:12 PM, Daniel Pono Takamori  wrote:
> 
> Docker filled up /var/ so I cleared out the old images.  Going to work
> on making sure docker isn't a disk hog in the future.


ha. Kind of ironic that my Yetus job failed... it would have cleaned it 
up.  I guess we should probably make a way to run Yetus' docker cleanup code 
independently. We could then schedule a job to run every X days to do the 
cleanup automatically.

Tons of nodes offline/broken

2017-06-22 Thread Allen Wittenauer

Hi all.

Just noticed that there are a ton of H nodes offline and qnode3 is 
failing builds because it is out of space.

Thanks.

  1   2   >