[JIRA] [core] (JENKINS-24155) Jenkins Slaves Go Offline In Large Quantities and Don't Reconnect Until Reboot

[email protected] (JIRA) Tue, 19 Aug 2014 15:48:13 -0700

After a bit of experimenting I can reproduce this scenario fairly easily using a Ubuntu 12.04 master and a Windows 7 slave (both using java 7). Once the slave is connected I forcibly suspend the windows 7 computer and monitor the TCP connection between the master and slave (on the master) using netstat.

netstat -na | grep 42715
tcp6       0      0 :::42715                :::*                    LISTEN
tcp6       0      0 192.168.1.23:42715      192.168.1.24:58905      ESTABLISHED
tcp6       0   2479 192.168.1.23:42715      192.168.1.115:61283     ESTABLISHED

When the master attempts to "ping" the slave the slave does not respond and the TCP send queue builds up (2479 in the example above).

Once the queue has been building for a few minutes bring the Windows 7 machine back to life and let things recover naturally.

I observe that the Windows 7 machine issues a TCP RST on the connection. But the Linux master does not seem to react to the RST and continues to add data into the send queue.

During this time the slave has attempted to restart the connection and failed because the master thinks that the slave is still connected. The windows slave service stops attempting a restart after a couple of failures.

After a few minutes the channel pinger on the master declares that the slave is dead

Aug 19, 2014 9:34:24 PM hudson.slaves.ChannelPinger$1 onDead
INFO: Ping failed. Terminating the channel.
java.util.concurrent.TimeoutException: Ping started on 1408480224640 hasn't completed at 1408480464640
        at hudson.remoting.PingThread.ping(PingThread.java:120)
        at hudson.remoting.PingThread.run(PingThread.java:81)

But even at this time the TCP stream stays open the slave connection is continuing to operate.

After a further 10 minutes the connection does close. It seems like this is a standard TCP timeout.

WARNING: Communication problem
java.io.IOException: Connection timed out
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
        at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
        at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
        at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

Aug 19, 2014 9:44:13 PM jenkins.slaves.JnlpSlaveAgentProtocol$Handler$1 onClosed
WARNING: NioChannelHub keys=2 gen=2823: Computer.threadPoolForRemoting [#2] for + Cygnet terminated
java.io.IOException: Failed to abort
        at org.jenkinsci.remoting.nio.NioChannelHub$NioTransport.abort(NioChannelHub.java:195)
        at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:581)
        at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: Connection timed out
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at org.jenkinsci.remoting.nio.FifoBuffer$Pointer.receive(FifoBuffer.java:136)
        at org.jenkinsci.remoting.nio.FifoBuffer.receive(FifoBuffer.java:306)
        at org.jenkinsci.remoting.nio.NioChannelHub.run(NioChannelHub.java:514)
        ... 6 more

There does not seem to be a "fail fast" method in operation. It isn't clear whether this is due to the Linux networking stack or whether Java could have failed a lot quicker when it determined that the connection ping had timed out.

Sadly it is not immediately obvious that the disconnect could be done instantly because it all seems to be tied up with standard TCP retries.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators.
For more information on JIRA, see: http://www.atlassian.com/software/jira

--
You received this message because you are subscribed to the Google Groups "Jenkins Issues" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
For more options, visit https://groups.google.com/d/optout.

[JIRA] [core] (JENKINS-24155) Jenkins Slaves Go Offline In Large Quantities and Don't Reconnect Until Reboot

Reply via email to