Re: Jenkins slave appear offline - SSHLauncher threads BLOCKED

Stephen Connolly Tue, 06 May 2014 04:41:35 -0700

On 5 May 2014 22:19, Charles Chan <[email protected]> wrote:

> Hello Stephen,
>
> Thank you for the informative reply. I look forward to your blog post!
>
>
I'm looking forward to writing it... it will not be this week...



> To answer your question, we have approximately 2 dozen standard ssh Linux
> slaves, and about 10 JNLP Windows slaves to support various
> platform/configurations.
>

Assuming your master is beefier than an m3.large, and your jobs are less
chatty that my mock-load-builder that should be perfectly reasonable.


>
> Based on the build history, sometimes we have up to 10 jobs running
> concurrently. Not 24x7, approximately once every 2 hours, and queue is
> pretty much empty most of the time. I would qualify the system as light
> traffic.
>

Yeah sounds like a typical system.


>
> From your reply, I am even more concerned with disproportionally high
> number of the blocked threads (120) compare to offline slaves (2 at the
> time), as it sounds like it should be closer to 1:1?
>

Yes, it sounds like there is a race condition between the post disconnect
tasks and the reconnect tasks:
https://github.com/jenkinsci/ssh-slaves-plugin/blob/ssh-slaves-1.6/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L1152is
blocking until the slave is connected... but the slave cannot connect
until the disconnect tasks are complete...


> Also, do you know if the standard ssh connector performs a timeout and
> reconnect or does it block indefinitely? Not sure if each attempt to
> reconnect is spawning off new blocked threads?!
>
> Let me know if there is any other information which could prove to be
> useful.
>
> Charles
>
>
> On Monday, May 5, 2014 12:42:23 PM UTC-7, Stephen Connolly wrote:
>
>>
>> How many slaves do you have?
>>
>> It is rather easy to saturate a server with a small number of ssh-slaves
>> based slaves.
>>
>> For example, on an AWS m3.large class machine, 10 ssh-slaves concurrently
>> building jobs as chatty as the mock-load-builder job type is the most you
>> can push.
>>
>> If you use JNLP slaves, you can get close to 60 concurrent builds before
>> the system starts falling over.
>>
>> The CloudBees NIO ssh-slaves plugin (part if the enterprise offering) has
>> a different performance characteristic... My most recent tests I was able
>> to get up to 120 concurrent builds, without affecting the Jenkins UI (I
>> only had set up for that number of slaves... It likely can go further,
>> though m3.large is not beefy enough) what was affected though we're build
>> times. The builds were 2-3 times slower due to back-pressure effects
>> causing the builds to block on STDOUT.
>>
>> If anyone else is interested, we will be releasing our scalability test
>> harness (actually I will be ripping the bottom out of the acceptance test
>> framework and putting the scalability harness in its place... But the
>> harness is also useful for scalability testing). We will also be publishing
>> our findings.
>>
>> The other thing to watch is how your entropy pool is holding up. The
>> default random source in Linux typically gets exhausted quite quickly. That
>> can cause your ssh slaves to fail ping tests and timeout/block
>>
>> I think the package you want to install is haveged
>>
>> That or switch java to /dev/urandom
>>
>> Note: I am currently not recommending any specific slave connector, there
>> are trade-offs with each type of connector. I will be writing up a blog
>> post in the near future discussing the various trade-offs.
>>
>> Standard ssh-slaves degrades poorly... This is great if you want to know
>> when you have reached your limit
>>
>> NIO ssh-slaves degrades gracefully, I need to determine where it starts
>> degrading relative to standard ssh-slaves, but if UI responsiveness is more
>> important than build times then this has advantages (though you need to be
>> a paying cloudbees customer)
>>
>> JNLP scales the highest without affecting build times, but degrades
>> fastest, is a poor fit for on-demand connection/retention strategies and
>> does not offer the same transport encryption security as the ssh- versions
>>
>> Those are just the brief high-level measures
>>
>> On Monday, 5 May 2014, Charles Chan <[email protected]> wrote:
>>
>>> Hello,
>>>
>>> One of the issue we have recently been experiencing with Jenkins is that 
>>> the slaves (node) would go offline for no apparent reason and would not 
>>> reconnect automatically.
>>> When slaves appear as offline, we tried to launch/reconnect the slave 
>>> manually but it does not work either. However, we are able to SSH into the 
>>> machine using PuTTy.
>>>
>>>
>>> The only workaround is to restart the Jenkins server, until the problem 
>>> surfaces again. (Typically in a week.)
>>>
>>> Instance Information
>>> --------------------
>>> Jenkins Server:            1.562
>>> SSH Credentials Plugin:    1.6.1
>>>
>>>
>>> SSH Slaves Plugin          1.6
>>>
>>> Thread dump of slave node:
>>> {dump}
>>> "Channel reader thread: qa-linbuild-02" prio=5 WAITING
>>>     java.lang.Object.wait(Native Method)
>>>     java.lang.Object.wait(Object.java:485)
>>>     
>>> com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109)
>>>     
>>> com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583)
>>>     com.trilead.ssh2.Session.<init>(Session.java:41)
>>>     com.trilead.ssh2.Connection.openSession(Connection.java:1129)
>>>     com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99)
>>>     com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119)
>>>     
>>> hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160)
>>>     hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:437)
>>>     hudson.remoting.Channel.terminate(Channel.java:819)
>>>     
>>> hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:76)
>>>
>>> "Channel reader thread: qa-linbuild-03" prio=5 WAITING
>>>     java.lang.Object.wait(Native Method)
>>>     java.lang.Object.wait(Object.java:485)
>>>     
>>> com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109)
>>>     
>>> com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583)
>>>     com.trilead.ssh2.Session.<init>(Session.java:41)
>>>     com.trilead.ssh2.Connection.openSession(Connection.java:1129)
>>>     com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99)
>>>     com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119)
>>>     
>>> hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160)
>>>     hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:437)
>>>     hudson.remoting.Channel.terminate(Channel.java:819)
>>>     
>>> hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:76)
>>> {dump}
>>>
>>> Also concerning is the number of threads is in the BLOCKED (126!).
>>> Doesn't seem normal as there are no BLOCKED threads after the server is 
>>> restarted.
>>>
>>>
>>> {dump}
>>> // 118 instances
>>> "Computer.threadPoolForRemoting [#26]" daemon prio=5 BLOCKED
>>>     
>>> hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1152)
>>>     hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:542)
>>>
>>>
>>>     
>>> jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
>>>     java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>>>     java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>
>>>
>>>     java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>     
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>     
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>
>>>
>>>     java.lang.Thread.run(Thread.java:662)
>>>
>>> // 8 instances
>>> "Computer.threadPoolForRemoting [#2922]" daemon prio=5 BLOCKED
>>>     hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:639)
>>>
>>>     hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:222)
>>>
>>>     
>>> jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
>>>     java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>     java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>
>>>
>>>     
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>     
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>     java.lang.Thread.run(Thread.java:662)
>>>
>>> {dump}
>>>
>>> Looking forward to any ideas or suggestions.
>>>
>>> Thank you.
>>> Charles Chan
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "Jenkins Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>> Sent from my phone
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "Jenkins Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Jenkins slave appear offline - SSHLauncher threads BLOCKED

Reply via email to