On 5 May 2014 22:19, Charles Chan <[email protected]> wrote: > Hello Stephen, > > Thank you for the informative reply. I look forward to your blog post! > > I'm looking forward to writing it... it will not be this week...
> To answer your question, we have approximately 2 dozen standard ssh Linux > slaves, and about 10 JNLP Windows slaves to support various > platform/configurations. > Assuming your master is beefier than an m3.large, and your jobs are less chatty that my mock-load-builder that should be perfectly reasonable. > > Based on the build history, sometimes we have up to 10 jobs running > concurrently. Not 24x7, approximately once every 2 hours, and queue is > pretty much empty most of the time. I would qualify the system as light > traffic. > Yeah sounds like a typical system. > > From your reply, I am even more concerned with disproportionally high > number of the blocked threads (120) compare to offline slaves (2 at the > time), as it sounds like it should be closer to 1:1? > Yes, it sounds like there is a race condition between the post disconnect tasks and the reconnect tasks: https://github.com/jenkinsci/ssh-slaves-plugin/blob/ssh-slaves-1.6/src/main/java/hudson/plugins/sshslaves/SSHLauncher.java#L1152is blocking until the slave is connected... but the slave cannot connect until the disconnect tasks are complete... > Also, do you know if the standard ssh connector performs a timeout and > reconnect or does it block indefinitely? Not sure if each attempt to > reconnect is spawning off new blocked threads?! > > Let me know if there is any other information which could prove to be > useful. > > Charles > > > On Monday, May 5, 2014 12:42:23 PM UTC-7, Stephen Connolly wrote: > >> >> How many slaves do you have? >> >> It is rather easy to saturate a server with a small number of ssh-slaves >> based slaves. >> >> For example, on an AWS m3.large class machine, 10 ssh-slaves concurrently >> building jobs as chatty as the mock-load-builder job type is the most you >> can push. >> >> If you use JNLP slaves, you can get close to 60 concurrent builds before >> the system starts falling over. >> >> The CloudBees NIO ssh-slaves plugin (part if the enterprise offering) has >> a different performance characteristic... My most recent tests I was able >> to get up to 120 concurrent builds, without affecting the Jenkins UI (I >> only had set up for that number of slaves... It likely can go further, >> though m3.large is not beefy enough) what was affected though we're build >> times. The builds were 2-3 times slower due to back-pressure effects >> causing the builds to block on STDOUT. >> >> If anyone else is interested, we will be releasing our scalability test >> harness (actually I will be ripping the bottom out of the acceptance test >> framework and putting the scalability harness in its place... But the >> harness is also useful for scalability testing). We will also be publishing >> our findings. >> >> The other thing to watch is how your entropy pool is holding up. The >> default random source in Linux typically gets exhausted quite quickly. That >> can cause your ssh slaves to fail ping tests and timeout/block >> >> I think the package you want to install is haveged >> >> That or switch java to /dev/urandom >> >> Note: I am currently not recommending any specific slave connector, there >> are trade-offs with each type of connector. I will be writing up a blog >> post in the near future discussing the various trade-offs. >> >> Standard ssh-slaves degrades poorly... This is great if you want to know >> when you have reached your limit >> >> NIO ssh-slaves degrades gracefully, I need to determine where it starts >> degrading relative to standard ssh-slaves, but if UI responsiveness is more >> important than build times then this has advantages (though you need to be >> a paying cloudbees customer) >> >> JNLP scales the highest without affecting build times, but degrades >> fastest, is a poor fit for on-demand connection/retention strategies and >> does not offer the same transport encryption security as the ssh- versions >> >> Those are just the brief high-level measures >> >> On Monday, 5 May 2014, Charles Chan <[email protected]> wrote: >> >>> Hello, >>> >>> One of the issue we have recently been experiencing with Jenkins is that >>> the slaves (node) would go offline for no apparent reason and would not >>> reconnect automatically. >>> When slaves appear as offline, we tried to launch/reconnect the slave >>> manually but it does not work either. However, we are able to SSH into the >>> machine using PuTTy. >>> >>> >>> The only workaround is to restart the Jenkins server, until the problem >>> surfaces again. (Typically in a week.) >>> >>> Instance Information >>> -------------------- >>> Jenkins Server: 1.562 >>> SSH Credentials Plugin: 1.6.1 >>> >>> >>> SSH Slaves Plugin 1.6 >>> >>> Thread dump of slave node: >>> {dump} >>> "Channel reader thread: qa-linbuild-02" prio=5 WAITING >>> java.lang.Object.wait(Native Method) >>> java.lang.Object.wait(Object.java:485) >>> >>> com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109) >>> >>> com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583) >>> com.trilead.ssh2.Session.<init>(Session.java:41) >>> com.trilead.ssh2.Connection.openSession(Connection.java:1129) >>> com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99) >>> com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119) >>> >>> hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160) >>> hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:437) >>> hudson.remoting.Channel.terminate(Channel.java:819) >>> >>> hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:76) >>> >>> "Channel reader thread: qa-linbuild-03" prio=5 WAITING >>> java.lang.Object.wait(Native Method) >>> java.lang.Object.wait(Object.java:485) >>> >>> com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109) >>> >>> com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583) >>> com.trilead.ssh2.Session.<init>(Session.java:41) >>> com.trilead.ssh2.Connection.openSession(Connection.java:1129) >>> com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99) >>> com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119) >>> >>> hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160) >>> hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:437) >>> hudson.remoting.Channel.terminate(Channel.java:819) >>> >>> hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:76) >>> {dump} >>> >>> Also concerning is the number of threads is in the BLOCKED (126!). >>> Doesn't seem normal as there are no BLOCKED threads after the server is >>> restarted. >>> >>> >>> {dump} >>> // 118 instances >>> "Computer.threadPoolForRemoting [#26]" daemon prio=5 BLOCKED >>> >>> hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1152) >>> hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:542) >>> >>> >>> >>> jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>> >>> >>> java.util.concurrent.FutureTask.run(FutureTask.java:138) >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>> >>> >>> java.lang.Thread.run(Thread.java:662) >>> >>> // 8 instances >>> "Computer.threadPoolForRemoting [#2922]" daemon prio=5 BLOCKED >>> hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:639) >>> >>> hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:222) >>> >>> >>> jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) >>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>> java.util.concurrent.FutureTask.run(FutureTask.java:138) >>> >>> >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>> java.lang.Thread.run(Thread.java:662) >>> >>> {dump} >>> >>> Looking forward to any ideas or suggestions. >>> >>> Thank you. >>> Charles Chan >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Jenkins Users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> -- >> Sent from my phone >> > -- > You received this message because you are subscribed to the Google Groups > "Jenkins Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Jenkins Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
