To close this thread, the disconnection of linux agents was a side effect of the kernel settings which were using a too long tcp timeout Gavin applied the settings recommended in this doc ( https://support.cloudbees.com/hc/en-us/articles/115001369667-Dedicated-SSH-agent-gets-disconnected ) and it solved this issue
sysctl -w net.ipv4.tcp_keepalive_time=120 sysctl -w net.ipv4.tcp_keepalive_intvl=30 sysctl -w net.ipv4.tcp_keepalive_probes=8 sysctl -w net.ipv4.tcp_fin_timeout=30 Thanks a lot Gavin for your help On Tue, Jul 27, 2021 at 10:48 AM Arnaud Héritier <[email protected]> wrote: > 👍 thanks > As discussed on Slack I will open.a support case on CloudBees side to > study the instability issue of linux agents. > I will verify that there is nothing wrong in the new setup (but I found > nothing bad personally which could create such issue) > The major change when we compare ci-builds and ci-maven environments is > that our agents are now running on Azure and sadly I heard about similar > issues in others communities using Azure like Jenkins > I will check if we can find a solution or if we can give enough details to > Gavin to open a case on Azure side too. > If we really don't find any solution with Azure we'll see with Gavin to > deploy our agents somewhere else (but let's try to give a chance to Azure > first) > > Cheers > > > > On Tue, Jul 27, 2021 at 10:37 AM Gavin McDonald <[email protected]> > wrote: > >> On Tue, Jul 27, 2021 at 10:18 AM Arnaud Héritier <[email protected]> >> wrote: >> >> > Gavin, these JDKs are only for build agents, right ? >> > Tibor was asking for the JVM used to host Tomcat/Jenkins. >> > (And I suppose that the controller part is templatised) >> > >> >> Oh right, sorry, yes the client controllers use a system openjdk 8 >> >> >> https://github.com/apache/infrastructure-p6/blob/production/modules/jenkins_client_master/manifests/init.pp >> >> >> > On Mon, Jul 26, 2021 at 1:06 PM Gavin McDonald <[email protected]> >> > wrote: >> > >> >> There are MANY JDKS installed already. Oracle JDKs OpenJDKs >> AdoptOpenJDKs >> >> - they are all already there. >> >> >> >> >> https://cwiki.apache.org/confluence/display/INFRA/JDK+Installation+Matrix >> >> >> >> On Mon, Jul 26, 2021 at 11:47 AM Arnaud Héritier <[email protected]> >> >> wrote: >> >> >> >> > It has to be discussed with infra. >> >> > I am not sure which distro is used. >> >> > It's a Private Build of openJDK 8 (today CloudBees CI doesn't support >> >> > something else than Java 8 - no comment) >> >> > I don't have the feeling for now (with the data I reviewed) that >> it's a >> >> > memory / GC issue (it could create disconnections under high load) >> >> > Here the stability issue occurs even when the controller does nothing >> >> (the >> >> > controller cannot ping the agent or vice and versa) and it seems to >> >> impact >> >> > more the linux agents than the windows ones (it's a pity) >> >> > >> >> > >> >> > >> >> > On Fri, Jul 23, 2021 at 12:06 AM Tibor Digana < >> [email protected]> >> >> > wrote: >> >> > >> >> >> Can you install AdoptOpenJdk for the Jenkins controller? >> >> >> It contains Eclipse OpenJ9 Garbage Collector and it significantly >> >> >> decreases >> >> >> memory consumption of the application due to the meta space goes to >> the >> >> >> disk. >> >> >> You should save 40 - 75% out of 3GB. >> >> >> I used G1, Shenandoah, ZGC and Eclipse OpenJ9 which saved the most >> >> memory. >> >> >> >> >> >> On Thu, Jul 22, 2021 at 9:23 AM Arnaud Héritier < >> [email protected]> >> >> >> wrote: >> >> >> >> >> >> > yes for the controller it depends of its size (number of jobs and >> >> types >> >> >> of >> >> >> > jobs) but here we are fine it seems with our 3Gb >> >> >> > >> >> >> > * Java >> >> >> > - Version: 1.8.0_292 >> >> >> > - Maximum memory: 3.00 GB (3221225472) >> >> >> > - Allocated memory: 3.00 GB (3221225472) >> >> >> > - Free memory: 750.15 MB (786591664) >> >> >> > - In-use memory: 2.27 GB (2434633808) >> >> >> > - GC strategy: G1 >> >> >> > - Available CPUs: 2 >> >> >> > >> >> >> > For agents I reduced the memory allocated to the agent process >> but it >> >> >> > doesn't help much (it seems - even if it is still a good thing to >> do) >> >> >> > >> >> >> > What is strange is that I see our agents sometimes disconnected >> even >> >> >> when >> >> >> > we have no activity on the jenkins controller >> >> >> > >> >> >> > Sadly jenkins is deployed on Apache Tomcat thus I cannot get >> access >> >> to >> >> >> its >> >> >> > logs >> >> >> > >> >> >> > In general the connection lost is detected by what we call the >> >> >> PingThread ( >> >> >> > >> >> >> > >> >> >> >> >> >> https://www.jenkins.io/doc/book/system-administration/monitoring/#ping-thread >> >> >> > ) but not only >> >> >> > >> >> >> > https://ci-maven.apache.org/log/all >> >> >> > >> >> >> > For example it was few minutes ago we got 3 agents disconnected >> while >> >> >> > nothing was running >> >> >> > >> >> >> > 2021-07-22 06:58:21.769+0000 [id=106291] INFO >> >> >> > hudson.slaves.ChannelPinger$1#onDead: >> >> >> > Ping failed. Terminating the channel maven4. >> >> >> > java.util.concurrent.TimeoutException: Ping started at >> 1626936861769 >> >> >> hasn't >> >> >> > completed by 1626937101769 >> >> >> > at hudson.remoting.PingThread.ping(PingThread.java:134) >> >> >> > at hudson.remoting.PingThread.run(PingThread.java:90) >> >> >> > 2021-07-22 06:58:21.778+0000 [id=106292] INFO >> >> >> > hudson.slaves.ChannelPinger$1#onDead: >> >> >> > Ping failed. Terminating the channel maven3. >> >> >> > java.util.concurrent.TimeoutException: Ping started at >> 1626936861777 >> >> >> hasn't >> >> >> > completed by 1626937101778 >> >> >> > at hudson.remoting.PingThread.ping(PingThread.java:134) >> >> >> > at hudson.remoting.PingThread.run(PingThread.java:90) >> >> >> > 2021-07-22 06:58:21.983+0000 [id=106295] INFO >> >> >> > hudson.slaves.ChannelPinger$1#onDead: >> >> >> > Ping failed. Terminating the channel maven5. >> >> >> > java.util.concurrent.TimeoutException: Ping started at >> 1626936861982 >> >> >> hasn't >> >> >> > completed by 1626937101983 >> >> >> > at hudson.remoting.PingThread.ping(PingThread.java:134) >> >> >> > at hudson.remoting.PingThread.run(PingThread.java:90) >> >> >> > >> >> >> > @Gavin McDonald <[email protected]> In terms of network, is it >> >> the >> >> >> same >> >> >> > environment we use today compared to the ci-builds.apache.org >> >> >> environment >> >> >> > ? >> >> >> > >> >> >> > >> >> >> > On Wed, Jul 21, 2021 at 11:48 PM Tibor Digana < >> >> [email protected]> >> >> >> > wrote: >> >> >> > >> >> >> > > In my company, I also used 1GB for Xmx of Java Heap for the >> Jenkins >> >> >> JVM >> >> >> > and >> >> >> > > it was enough. >> >> >> > > The subprocesses like Maven need to have much more memory to >> >> allocate >> >> >> for >> >> >> > > themself rather than Jenkins JVM. >> >> >> > > T >> >> >> > > >> >> >> > > On Wed, Jul 21, 2021 at 6:38 PM Arnaud Héritier < >> >> [email protected]> >> >> >> > > wrote: >> >> >> > > >> >> >> > > > I am looking at our builds and I try to understand why our >> agents >> >> >> are >> >> >> > > often >> >> >> > > > disconnected during the builds. >> >> >> > > > We have in general a stacktrace like >> >> >> > > > >> >> >> > > > maven6 was marked offline: Connection was broken: >> >> >> java.io.IOException: >> >> >> > > > Pipe closed after 0 cycles >> >> >> > > > at >> >> >> > > > >> >> >> > > >> >> >> > >> >> >> >> >> >> org.apache.sshd.common.channel.ChannelPipedInputStream.read(ChannelPipedInputStream.java:118) >> >> >> > > > at >> >> >> > > > >> >> >> > > >> >> >> > >> >> >> >> >> >> org.apache.sshd.common.channel.ChannelPipedInputStream.read(ChannelPipedInputStream.java:101) >> >> >> > > > at >> >> >> > > > >> >> >> > > >> >> >> > >> >> >> >> >> >> hudson.remoting.FlightRecorderInputStream.read(FlightRecorderInputStream.java:92) >> >> >> > > > at >> >> >> > > > >> >> >> > >> >> >> >> >> >> hudson.remoting.ChunkedInputStream.readHeader(ChunkedInputStream.java:73) >> >> >> > > > at >> >> >> > > > >> >> >> > > >> >> >> > >> >> >> >> >> >> hudson.remoting.ChunkedInputStream.readUntilBreak(ChunkedInputStream.java:103) >> >> >> > > > at >> >> >> > > > >> >> >> > > >> >> >> > >> >> >> >> >> >> hudson.remoting.ChunkedCommandTransport.readBlock(ChunkedCommandTransport.java:39) >> >> >> > > > at >> >> >> > > > >> >> >> > > >> >> >> > >> >> >> >> >> >> hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:34) >> >> >> > > > at >> >> >> > > > >> >> >> > > >> >> >> > >> >> >> >> >> >> hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63) >> >> >> > > > >> >> >> > > > >> >> >> > > > >> >> >> > > > As far I can see we are using 16Gb "hosts" for linux agents >> >> >> > > > >> >> >> > > > Something very strange is that the jenkins agent (this small >> >> >> component >> >> >> > > > doing the link between the build host and the controller) is >> >> >> configured >> >> >> > > > with `-Xms8g -Xmx8g` thus we are reserving for it 50% of the >> >> server >> >> >> mem >> >> >> > > > (even more because of the non-heap) >> >> >> > > > This one in general should require in general really less. >> 1Gb is >> >> >> > > already a >> >> >> > > > lot from my exp. >> >> >> > > > Due to this, the OS can see it has the biggest process on the >> >> host >> >> >> and >> >> >> > > > decide to kill it when the rest of the memory is used by the >> >> build. >> >> >> > > > I think we should decrease this value. >> >> >> > > > (I can do it but I don't know how was configured the >> >> ci.apache.org >> >> >> > > agents >> >> >> > > > and I would like to not add more issue if this setting was >> here >> >> in >> >> >> the >> >> >> > > past >> >> >> > > > >> >> >> > > > I don't think it is the root cause of our instabilities (at >> least >> >> >> all) >> >> >> > > and >> >> >> > > > there is something else I have to find but it's a cheap fix to >> >> try >> >> >> > > > >> >> >> > > > FYI our agents VMs are ~like this today: >> >> >> > > > >> >> >> > > > - Java >> >> >> > > > + Home: `/usr/local/asfpackages/java/oraclejdk-1.8.0-291/jre` >> >> >> > > > + Vendor: Oracle Corporation >> >> >> > > > + Version: 1.8.0_291 >> >> >> > > > + Maximum memory: 7.67 GB (8232370176) >> >> >> > > > + Allocated memory: 7.67 GB (8232370176) >> >> >> > > > + Free memory: 6.03 GB (6470953760) >> >> >> > > > + In-use memory: 1.64 GB (1761416416) >> >> >> > > > + GC strategy: ParallelGC >> >> >> > > > + Available CPUs: 4 >> >> >> > > > >> >> >> > > > 8Gb is reserved, 1Gb is used (because the GC does nothing as >> the >> >> >> Free >> >> >> > mem >> >> >> > > > is high) >> >> >> > > > >> >> >> > > > I would be in favor to try to launch them with -Xms128m >> >> >> > > > -Xmx1g -XX:+UseG1GC -XX:+UseStringDeduplication >> >> >> > > > >> >> >> > > > I think it's enough customization to start with >> >> >> > > > >> >> >> > > > Cheers >> >> >> > > > >> >> >> > > > On Wed, Jul 21, 2021 at 1:28 PM Arnaud Héritier < >> >> >> [email protected]> >> >> >> > > > wrote: >> >> >> > > > >> >> >> > > > > I am not sure about the setup >> >> >> > > > > AFAICS we don't use any JDK installer ( >> >> >> > > > > https://ci-maven.apache.org/configureTools/ ) thus I >> suppose >> >> that >> >> >> > the >> >> >> > > > > different JDKs are supposed to be installed directly on the >> >> agent >> >> >> ? >> >> >> > > > > I am not sure how it was done on the previous environment >> >> >> > > > > >> >> >> > > > > On Sun, Jul 18, 2021 at 5:30 PM Tibor Digana < >> >> >> [email protected] >> >> >> > > >> >> >> > > > > wrote: >> >> >> > > > > >> >> >> > > > >> The new CI system has the following issue: >> >> >> > > > >> >> >> >> > > > >> /home/jenkins/tools/java/latest1.7/bin/java: not found >> >> >> > > > >> >> >> >> > > > >> >> >> >> > > > >> >> >> >> > > > >> >> >> > > >> >> >> > >> >> >> >> >> >> https://ci-maven.apache.org/job/Maven/job/maven-box/job/maven-surefire/job/master/104/execution/node/183/log/ >> >> >> > > > >> >> >> >> > > > >> >> >> >> > > > >> >> >> >> > > > >> On Wed, Jun 30, 2021 at 8:03 PM Gavin McDonald < >> >> >> > [email protected]> >> >> >> > > > >> wrote: >> >> >> > > > >> >> >> >> > > > >> > Hi Maven folks. >> >> >> > > > >> > >> >> >> > > > >> > Infra has decided to separate off the Maven build jobs >> from >> >> >> > > > >> > ci-builds.apache.org over to its very own Jenkins >> >> Controller >> >> >> and >> >> >> > > > >> Agents. >> >> >> > > > >> > >> >> >> > > > >> > This means that Maven now has a dedicated Jenkins >> >> environment >> >> >> for >> >> >> > > > >> itself. >> >> >> > > > >> > It >> >> >> > > > >> > also means that no other projects build jobs can build on >> >> the >> >> >> > Maven >> >> >> > > > >> nodes; >> >> >> > > > >> > and >> >> >> > > > >> > then Maven jobs will no longer be able to build on the >> >> >> ci-builds >> >> >> > > > jobs. >> >> >> > > > >> > >> >> >> > > > >> > Your new Controller is set up as >> >> https://ci-maven.apache.org >> >> >> and >> >> >> > > all >> >> >> > > > >> Maven >> >> >> > > > >> > Committers >> >> >> > > > >> > can login via LDAP and create jobs. >> >> >> > > > >> > >> >> >> > > > >> > At the time of writing, there is one node/agent attached >> >> but I >> >> >> am >> >> >> > > > >> building >> >> >> > > > >> > 4 more - all >> >> >> > > > >> > Ubuntu 20.04 and based in our Azure account. >> >> >> > > > >> > >> >> >> > > > >> > We can automagically move all your jobs over from >> ci-builds >> >> to >> >> >> > > > ci-maven >> >> >> > > > >> - I >> >> >> > > > >> > just need someone to tell me go ahead and do it. >> >> >> > > > >> > >> >> >> > > > >> > In the meantime, feel free to have a test. The remaining >> 4 >> >> >> agents >> >> >> > > will >> >> >> > > > >> be >> >> >> > > > >> > online >> >> >> > > > >> > by tomorrow. We will review after a month if 5 is enough >> >> nodes. >> >> >> > > > >> > >> >> >> > > > >> > As with other projects having their own dedicated >> >> controller, >> >> >> who >> >> >> > > have >> >> >> > > > >> > taken advantage >> >> >> > > > >> > of this isolation by having some nodes/agents given to >> the >> >> >> project >> >> >> > > as >> >> >> > > > a >> >> >> > > > >> > 'targeted donation' >> >> >> > > > >> > so someone here may know of a Company will to donate 5 - >> 10 >> >> or >> >> >> > more >> >> >> > > > >> nodes >> >> >> > > > >> > specifically >> >> >> > > > >> > for Maven Jenkins environment. Infra can afford to hand >> you >> >> >> over 5 >> >> >> > > > right >> >> >> > > > >> > now. >> >> >> > > > >> > >> >> >> > > > >> > Let me know if you have any questions, otherwise let me >> know >> >> >> when >> >> >> > I >> >> >> > > > can >> >> >> > > > >> > make the >> >> >> > > > >> > transfer of your jobs. >> >> >> > > > >> > >> >> >> > > > >> > Thanks >> >> >> > > > >> > >> >> >> > > > >> > -- >> >> >> > > > >> > >> >> >> > > > >> > *Gavin McDonald* >> >> >> > > > >> > Systems Administrator >> >> >> > > > >> > ASF Infrastructure Team >> >> >> > > > >> > >> >> >> > > > >> >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > -- >> >> >> > > > > Arnaud Héritier >> >> >> > > > > Twitter/Skype : aheritier >> >> >> > > > > >> >> >> > > > >> >> >> > > > >> >> >> > > > -- >> >> >> > > > Arnaud Héritier >> >> >> > > > Twitter/Skype : aheritier >> >> >> > > > >> >> >> > > >> >> >> > >> >> >> > >> >> >> > -- >> >> >> > Arnaud Héritier >> >> >> > Twitter/Skype : aheritier >> >> >> > >> >> >> >> >> > >> >> > >> >> > -- >> >> > Arnaud Héritier >> >> > Twitter/Skype : aheritier >> >> > >> >> >> >> >> >> -- >> >> >> >> *Gavin McDonald* >> >> Systems Administrator >> >> ASF Infrastructure Team >> >> >> > >> > >> > -- >> > Arnaud Héritier >> > Twitter/Skype : aheritier >> > >> >> >> -- >> >> *Gavin McDonald* >> Systems Administrator >> ASF Infrastructure Team >> > > > -- > Arnaud Héritier > Twitter/Skype : aheritier > -- Arnaud Héritier Twitter/GitHub/... : aheritier
