Deadlock in slave compuer launching

Nigel Magnay Tue, 20 May 2014 01:39:12 -0700

I have found that there seems to be a condition whereby when the process
looking to reap old nodes is running, and is in the process or removing a
node, it looks like a deadlock occurs.


Although this trace applies to the docker-plugin, it's using underneath
AbstractCloudSlave, and the implementation of the retention strategy is
practically the same as hudson.slaves.CloudRetentionStrategy.

What happens:

* Thread A (Timer) ComputerRetentionWork:66 is lookig for computers to reap.
* It calls DockerRetentionStrategy.check - which is a *synchronized* method,
creates a lock *[1]*
* That filters through and ultimately calls
hudson.slaves.AbstractCloudSlace.terminate
* AbstractNodeSlave tries to call synchronized method
Jenkins.getInstance().removeNode(this), but waits in lock *[2]*.

Meanwhile
* Thread B in the business of provisioning a new node, in DockerCloud (impl
is predominantly the same as EC2Cloud and various others).
* It calls Jenkins.addNode(...), which creates lock *[2]*
- that calls updateComputerList, which ultimately calls
DockerRetentionStrategy.check, which decides a node needs to be removed.
* That calls AbstractCloudSlave.terminate (as above). It already has the
lock so continues.
* It calls Jenkins.removeNode, which calls updateComputerList. This tries
to check the retention for the computer above by RetentionStrategy.check.
* But in order to do this, it needs lock *[1]. *

Therefore it's in deadlock.

I'm not sure why DockerRetentionStrategy needs to be synchronized on check.
I think that perhaps the fix is for the retention strategy to just
fall-through and return '1' if it's already in the process of being
checked, which is what I'm going to attempt to try.

I thought it worth discussing since CloudRetentionStrategy does the same
thing, and it might be a general issue.


---
"Computer.threadPoolForRemoting [#208]":
        at
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy.check(DockerRetentionStrategy.java:24)
        - waiting to lock <0x000000009a5f4a50> (a
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy)
        at
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy.check(DockerRetentionStrategy.java:13)
        at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:678)
        at
hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:120)
        at
hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:180)
        - locked <0x00000000805a7d68> (a java.lang.Object)
        at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1218)
        at jenkins.model.Jenkins.setNodes(Jenkins.java:1716)
        at jenkins.model.Jenkins.removeNode(Jenkins.java:1711)
        - locked <0x00000000805a7c50> (a hudson.model.Hudson)
        at
hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:65)
        at
com.nirima.jenkins.plugins.docker.DockerSlave.retentionTerminate(DockerSlave.java:161)
        at
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy.check(DockerRetentionStrategy.java:32)
        - locked <0x000000009a5f52d8> (a
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy)
        at
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy.check(DockerRetentionStrategy.java:13)
        at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:678)
        at
hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:120)
        at
hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:180)
        - locked <0x00000000805a7d68> (a java.lang.Object)
        at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1218)
        at jenkins.model.Jenkins.setNodes(Jenkins.java:1716)
        at jenkins.model.Jenkins.addNode(Jenkins.java:1698)
        - locked <0x00000000805a7c50> (a hudson.model.Hudson)
        at
com.nirima.jenkins.plugins.docker.DockerCloud$1.call(DockerCloud.java:131)
        at
com.nirima.jenkins.plugins.docker.DockerCloud$1.call(DockerCloud.java:125)
        at
jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
"jenkins.util.Timer [#3]":
        at jenkins.model.Jenkins.removeNode(Jenkins.java:1705)
        - waiting to lock <0x00000000805a7c50> (a hudson.model.Hudson)
        at
hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:65)
        at
com.nirima.jenkins.plugins.docker.DockerSlave.retentionTerminate(DockerSlave.java:161)
        at
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy.check(DockerRetentionStrategy.java:32)
        - locked <0x000000009a5f4a50> (a
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy)
        at
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy.check(DockerRetentionStrategy.java:13)
        at
hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:66)
        at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:54)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
        at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)

Found 1 deadlock.

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Deadlock in slave compuer launching

Reply via email to