I have found that there seems to be a condition whereby when the process
looking to reap old nodes is running, and is in the process or removing a
node, it looks like a deadlock occurs.
Although this trace applies to the docker-plugin, it's using underneath
AbstractCloudSlave, and the implementation of the retention strategy is
practically the same as hudson.slaves.CloudRetentionStrategy.
What happens:
* Thread A (Timer) ComputerRetentionWork:66 is lookig for computers to reap.
* It calls DockerRetentionStrategy.check - which is a *synchronized* method,
creates a lock *[1]*
* That filters through and ultimately calls
hudson.slaves.AbstractCloudSlace.terminate
* AbstractNodeSlave tries to call synchronized method
Jenkins.getInstance().removeNode(this), but waits in lock *[2]*.
Meanwhile
* Thread B in the business of provisioning a new node, in DockerCloud (impl
is predominantly the same as EC2Cloud and various others).
* It calls Jenkins.addNode(...), which creates lock *[2]*
- that calls updateComputerList, which ultimately calls
DockerRetentionStrategy.check, which decides a node needs to be removed.
* That calls AbstractCloudSlave.terminate (as above). It already has the
lock so continues.
* It calls Jenkins.removeNode, which calls updateComputerList. This tries
to check the retention for the computer above by RetentionStrategy.check.
* But in order to do this, it needs lock *[1]. *
Therefore it's in deadlock.
I'm not sure why DockerRetentionStrategy needs to be synchronized on check.
I think that perhaps the fix is for the retention strategy to just
fall-through and return '1' if it's already in the process of being
checked, which is what I'm going to attempt to try.
I thought it worth discussing since CloudRetentionStrategy does the same
thing, and it might be a general issue.
---
"Computer.threadPoolForRemoting [#208]":
at
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy.check(DockerRetentionStrategy.java:24)
- waiting to lock <0x000000009a5f4a50> (a
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy)
at
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy.check(DockerRetentionStrategy.java:13)
at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:678)
at
hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:120)
at
hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:180)
- locked <0x00000000805a7d68> (a java.lang.Object)
at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1218)
at jenkins.model.Jenkins.setNodes(Jenkins.java:1716)
at jenkins.model.Jenkins.removeNode(Jenkins.java:1711)
- locked <0x00000000805a7c50> (a hudson.model.Hudson)
at
hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:65)
at
com.nirima.jenkins.plugins.docker.DockerSlave.retentionTerminate(DockerSlave.java:161)
at
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy.check(DockerRetentionStrategy.java:32)
- locked <0x000000009a5f52d8> (a
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy)
at
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy.check(DockerRetentionStrategy.java:13)
at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:678)
at
hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:120)
at
hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:180)
- locked <0x00000000805a7d68> (a java.lang.Object)
at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1218)
at jenkins.model.Jenkins.setNodes(Jenkins.java:1716)
at jenkins.model.Jenkins.addNode(Jenkins.java:1698)
- locked <0x00000000805a7c50> (a hudson.model.Hudson)
at
com.nirima.jenkins.plugins.docker.DockerCloud$1.call(DockerCloud.java:131)
at
com.nirima.jenkins.plugins.docker.DockerCloud$1.call(DockerCloud.java:125)
at
jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
"jenkins.util.Timer [#3]":
at jenkins.model.Jenkins.removeNode(Jenkins.java:1705)
- waiting to lock <0x00000000805a7c50> (a hudson.model.Hudson)
at
hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:65)
at
com.nirima.jenkins.plugins.docker.DockerSlave.retentionTerminate(DockerSlave.java:161)
at
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy.check(DockerRetentionStrategy.java:32)
- locked <0x000000009a5f4a50> (a
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy)
at
com.nirima.jenkins.plugins.docker.DockerRetentionStrategy.check(DockerRetentionStrategy.java:13)
at
hudson.slaves.ComputerRetentionWork.doRun(ComputerRetentionWork.java:66)
at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:54)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Found 1 deadlock.
--
You received this message because you are subscribed to the Google Groups
"Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.