I think I should adjust one comment in my description: "It seems like Jenkins is hung in the build start phase." => "It seems like Jenkins is hung in the slave start phase."
Anyway, nothing interesting in the logs on the slave side since the Jenkins slave JAR never gets started. I dont see any process related to Jenkins on the slave machine.

I did notice a similar situation that I was able to dig into a little more. This was related to a physically hung machine that needed a reboot while a Jenkins slave instance was running. It could be related. I sent an email to the developer list to see if anyone could give credibility to my hypothesis. but unfortunately got no reply. I reproduce the email test I sent below with the information I gathered. Perhaps its related since the symptoms were essentially the same.
_____________________________________________________________________________________________

Hi,

I am trying to debug the following symptom:
Jenkins started a slave. The slave died (machine hung, never had a chance to communicate back to master). Jenkins tries to restart it, but is not able to. When trying to restart the slave manually nothing happens. The slave logs are and remain empty with the spinning icon just running.

I had a look at the thread dump and saw a number of threads blocked and waiting for the following thread:

"Computer.threadPoolForRemoting 14515" Id=577044 Group=main WAITING on com.trilead.ssh2.channel.Channel@76dd2191
at java.lang.Object.wait(Native Method)

  • waiting on com.trilead.ssh2.channel.Channel@76dd2191
    at java.lang.Object.wait(Object.java:503)
    at com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109)
    at com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583)
    at com.trilead.ssh2.Session.<init>(Session.java:41)
    at com.trilead.ssh2.Connection.openSession(Connection.java:1129)
  • locked com.trilead.ssh2.Connection@1e553bf
    at com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99)
    at com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119)
    at hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160)
  • locked hudson.plugins.sshslaves.SSHLauncher@3c884383
    at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:547)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

Number of locked synchronizers = 1

  • java.util.concurrent.ThreadPoolExecutor$Worker@638f4d22

Looking at the code for hudson.plugins.sshslaves.SSHLauncher.java in afterDisconnect I see no hint of code that deals with timeouts. Looking further up the stack I wonder that happens when openSessionChannel tries to make a connection to the slave but it dies on the other side. The code does not look like it times out. If this is the case and whatever is on the other side of the channel that is supposed to respond is also dead, it would seem to me that waitUntilChannelOpen will never return and hang forever. Thus, the hudson.plugins.sshslaves.SSHLauncher lock will never be released and other threads wanting this lock will block forever. i.e. effective deadlock.

Can anyone confirm or refute my logic here? This certainly seems it could explain my symptoms.

Kind regards.

Artur

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators.
For more information on JIRA, see: http://www.atlassian.com/software/jira

--
You received this message because you are subscribed to the Google Groups "Jenkins Issues" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to