I see now that the value for "kexTimeout" should be the 210 value you
reference. BTW, we are using only 5 agents.
I found reports of a similar problem with connection timeouts here:
https://github.com/jenkinsci/ec2-fleet-plugin/issues/41
and as an experiment, followed the recommendation of increasing the
connection time to an absurdly high value, viz., 6000. The connection
timeouts stopped, but we're still seeing communictions problems. It looks
like it's originating on the slaves. The remoting log contains:
Feb 11, 2019 8:15:03 AM hudson.remoting.ProxyOutputStream$Chunk$1 run
WARNING: Failed to ack the stream
java.io.IOException: Broken pipe
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:313)
at
hudson.remoting.StandardOutputStream.write(StandardOutputStream.java:83)
at
hudson.remoting.ChunkedOutputStream.sendFrame(ChunkedOutputStream.java:89)
at
hudson.remoting.ChunkedOutputStream.sendBreak(ChunkedOutputStream.java:62)
at
hudson.remoting.ChunkedCommandTransport.writeBlock(ChunkedCommandTransport.java:46)
at
hudson.remoting.AbstractSynchronousByteArrayCommandTransport.write(AbstractSynchronousByteArrayCommandTransport.java:47)
at hudson.remoting.Channel.send(Channel.java:721)
at
hudson.remoting.ProxyOutputStream$Chunk$1.run(ProxyOutputStream.java:270)
at hudson.remoting.PipeWriter$1.run(PipeWriter.java:158)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:131)
at
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Feb 11, 2019 8:15:03 AM
hudson.remoting.SynchronousCommandTransport$ReaderThread run
SEVERE: I/O error in channel channel
java.io.IOException: Unexpected termination of the channel
at
hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:77)
Caused by: java.io.EOFException
at
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2671)
at
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3146)
at
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:858)
at java.io.ObjectInputStream.<init>(ObjectInputStream.java:354)
at
hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
at hudson.remoting.Command.readFrom(Command.java:140)
at hudson.remoting.Command.readFrom(Command.java:126)
at
hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:36)
at
hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:63)
On Wednesday, February 6, 2019 at 1:43:24 PM UTC-5, Ivan Fernandez Calvo
wrote:
>
> This timeout, it is only for the connection stage, and it includes whole
> retry reconnections, long history, the default value is 210 seconds, less
> than 30-60 seconds it is not a good value and only if you have reties to 0.
>
> I do not know how many agents you spin at the same time, I would try to
> find which number is the limit, I meant, If I spin 50 agent and is stable
> for some hours, I would increase de number until I find the limit
>
> Un Saludo
> Ivan Fernandez Calvo
>
> El 6 feb 2019, a las 19:01, Glenn Burkhardt <[email protected]
> <javascript:>> escribió:
>
> My reading of the code indicates that the timeout value is set by
> "kexTimeout" in com\trilead\ssh2\Connection.java at line 693. That appears
> to be set in SSHLauncher.openConnection():1184. The value we're using for
> 'launchTimeoutMillis' should be 15000, assuming that it comes from "Startup
> Idle" in the slave configuration (we enter '15', since the help says that
> the units are seconds). The machine we're using has 64gb of RAM, and 24
> cores. It's possible that it's a performance issue, but I doubt it. I'll
> try monitoring that a bit...
>
> Thanks for your response.
>
> On Wednesday, February 6, 2019 at 11:14:53 AM UTC-5, Ivan Fernandez Calvo
> wrote:
>>
>> >Jenkins and the VMs are all running on the same machine, so network
>> activity shouldn't be an issue.
>>
>> network is not an issue, but performance response, Could it be? it is not
>> a good idea to run the Jenkins Master and Agents in the same machine, if
>> you use Docker container and you do not limit the resources used by each
>> container (memory and CPU), they will fight for resources and some of them
>> will be killed by docker with a nice OOM.
>>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Jenkins Users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/jenkinsci-users/VbdIfb4ua9A/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected] <javascript:>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/jenkinsci-users/21922cc3-652d-4fee-b6d6-b90fef7a4bba%40googlegroups.com
>
> <https://groups.google.com/d/msgid/jenkinsci-users/21922cc3-652d-4fee-b6d6-b90fef7a4bba%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>
--
You received this message because you are subscribed to the Google Groups
"Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/jenkinsci-users/cbb19c5d-5031-48b9-9e5c-193810e922d3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.