Unfortunately it's not possible to reconnect to an SSH session; if the session 
is disconnected, the SSH daemon on the receiving end will close its end, and 
kill any processes that had been launched by that connection. In other words, 
any job that was running will be lost.

----- Original Message -----
From: [email protected]
To: [email protected]
At: May  5 2014 17:19:46

Hello Stephen,

Thank you for the informative reply. I look forward to your blog post!

To answer your question, we have approximately 2 dozen standard ssh Linux 
slaves, and about 10 JNLP Windows slaves to support various 
platform/configurations.

Based on the build history, sometimes we have up to 10 jobs running 
concurrently. Not 24x7, approximately once every 2 hours, and queue is pretty 
much empty most of the time. I would qualify the system as light traffic.

>From your reply, I am even more concerned with disproportionally high number 
>of the blocked threads (120) compare to offline slaves (2 at the time), as it 
>sounds like it should be closer to 1:1? Also, do you know if the standard ssh 
>connector performs a timeout and reconnect or does it block indefinitely? Not 
>sure if each attempt to reconnect is spawning off new blocked threads?!

Let me know if there is any other information which could prove to be useful.

Charles

On Monday, May 5, 2014 12:42:23 PM UTC-7, Stephen Connolly wrote:

How many slaves do you have?

It is rather easy to saturate a server with a small number of ssh-slaves based 
slaves.

For example, on an AWS m3.large class machine, 10 ssh-slaves concurrently 
building jobs as chatty as the mock-load-builder job type is the most you can 
push.

If you use JNLP slaves, you can get close to 60 concurrent builds before the 
system starts falling over.

The CloudBees NIO ssh-slaves plugin (part if the enterprise offering) has a 
different performance characteristic... My most recent tests I was able to get 
up to 120 concurrent builds, without affecting the Jenkins UI (I only had set 
up for that number of slaves... It likely can go further, though m3.large is 
not beefy enough) what was affected though we're build times. The builds were 
2-3 times slower due to back-pressure effects causing the builds to block on 
STDOUT.

If anyone else is interested, we will be releasing our scalability test harness 
(actually I will be ripping the bottom out of the acceptance test framework and 
putting the scalability harness in its place... But the harness is also useful 
for scalability testing). We will also be publishing our findings.

The other thing to watch is how your entropy pool is holding up. The default 
random source in Linux typically gets exhausted quite quickly. That can cause 
your ssh slaves to fail ping tests and timeout/block

I think the package you want to install is haveged

That or switch java to /dev/urandom

Note: I am currently not recommending any specific slave connector, there are 
trade-offs with each type of connector. I will be writing up a blog post in the 
near future discussing the various trade-offs.

Standard ssh-slaves degrades poorly... This is great if you want to know when 
you have reached your limit

NIO ssh-slaves degrades gracefully, I need to determine where it starts 
degrading relative to standard ssh-slaves, but if UI responsiveness is more 
important than build times then this has advantages (though you need to be a 
paying cloudbees customer)

JNLP scales the highest without affecting build times, but degrades fastest, is 
a poor fit for on-demand connection/retention strategies and does not offer the 
same transport encryption security as the ssh- versions

Those are just the brief high-level measures

On Monday, 5 May 2014, Charles Chan <[email protected]> wrote:

Hello,

One of the issue we have recently been experiencing with Jenkins is that the 
slaves (node) would go offline for no apparent reason and would not reconnect 
automatically.
When slaves appear as offline, we tried to launch/reconnect the slave manually 
but it does not work either. However, we are able to SSH into the machine using 
PuTTy.
The only workaround is to restart the Jenkins server, until the problem 
surfaces again. (Typically in a week.)

Instance Information
--------------------
Jenkins Server:            1.562
SSH Credentials Plugin:    1.6.1
SSH Slaves Plugin          1.6

Thread dump of slave node:
{dump}
"Channel reader thread: qa-linbuild-02" prio=5 WAITING         
java.lang.Object.wait(Native Method)         
java.lang.Object.wait(Object.java:485)         
com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109)
         
com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583)
         com.trilead.ssh2.Session.<init>(Session.java:41)         
com.trilead.ssh2.Connection.openSession(Connection.java:1129)         
com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99)         
com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119)         
hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160)     
    hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:437)         
hudson.remoting.Channel.terminate(Channel.java:819)         
hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:76)
  "Channel reader thread: qa-linbuild-03" prio=5 WAITING         
java.lang.Object.wait(Native Method)         
java.lang.Object.wait(Object.java:485)         
com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109)
         
com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583)
         com.trilead.ssh2.Session.<init>(Session.java:41)         
com.trilead.ssh2.Connection.openSession(Connection.java:1129)         
com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99)         
com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119)         
hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160)     
    hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:437)         
hudson.remoting.Channel.terminate(Channel.java:819)         
hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:76)
{dump}

Also concerning is the number of threads is in the BLOCKED (126!). 
Doesn't seem normal as there are no BLOCKED threads after the server is 
restarted.
{dump}
// 118 instances
"Computer.threadPoolForRemoting [#26]" daemon prio=5 BLOCKED
     hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1152)
       hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:542)
        
jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
     java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        java.util.concurrent.FutureTask.run(FutureTask.java:138)
  
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        java.lang.Thread.run(Thread.java:662)

// 8 instances
"Computer.threadPoolForRemoting [#2922]" daemon prio=5 BLOCKED
     hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:639)
 hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:222)
        
jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
    java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        java.util.concurrent.FutureTask.run(FutureTask.java:138)
        
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   java.lang.Thread.run(Thread.java:662)
{dump}

Looking forward to any ideas or suggestions.

Thank you.
Charles Chan
-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
 For more options, visit https://groups.google.com/d/optout.


-- 
Sent from my phone

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to