Aled Sage created BROOKLYN-298:
----------------------------------
Summary: sshj hangs (waiting for shell to finish) after script
completed - maybe VPN went down+up during exec
Key: BROOKLYN-298
URL: https://issues.apache.org/jira/browse/BROOKLYN-298
Project: Brooklyn
Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Aled Sage
I was deploying an app whose launch command started docker and pulled an image.
The task hung, showing in the web-console:
{noformat}
In progress - SSH executing, launching VanillaSoftwareProcessImpl{id=nisq2gz4yi}
{noformat}
I believe this is because my VPN disconnected and then reconnected, and our
sshj command keeps waiting for the result - even though the command has
finished executing.
Looking at the target VM, the command has completed (and the script uploaded by
SshjTool has been deleted). There is no evidence of any Brooklyn-initiated
commands executing, according to {{ps aux}}.
Drilling into the activity view in the Brooklyn web-console, the currently
executing thread shows:
{noformat}
SSH executing, launching VanillaSoftwareProcessImpl{id=nisq2gz4yi}
Task[ssh: launching VanillaSoftwareProcessImpl{id=nisq2gz4yi}]@TPnVc8Qs
Submitted by SoftlyPresent[value=Task[launch (main)]@mvL4OvdH]
In progress, thread waiting (timed) on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@408df99d
At: net.schmizz.concurrent.Promise.tryRetrieve(Promise.java:168)
net.schmizz.concurrent.Promise.retrieve(Promise.java:137)
net.schmizz.concurrent.Event.await(Event.java:103)
net.schmizz.sshj.connection.channel.AbstractChannel.join(AbstractChannel.java:282)
org.apache.brooklyn.util.core.internal.ssh.sshj.SshjTool$ShellAction.create(SshjTool.java:1012)
org.apache.brooklyn.util.core.internal.ssh.sshj.SshjTool$ShellAction.create(SshjTool.java:925)
org.apache.brooklyn.util.core.internal.ssh.sshj.SshjTool.acquire(SshjTool.java:630)
org.apache.brooklyn.util.core.internal.ssh.sshj.SshjTool.acquire(SshjTool.java:616)
org.apache.brooklyn.util.core.internal.ssh.sshj.SshjTool$1.run(SshjTool.java:331)
org.apache.brooklyn.util.core.internal.ssh.sshj.SshjTool.execScript(SshjTool.java:326)
org.apache.brooklyn.util.core.task.system.internal.ExecWithLoggingHelpers$1.exec(ExecWithLoggingHelpers.java:82)
org.apache.brooklyn.util.core.task.system.internal.ExecWithLoggingHelpers$3.apply(ExecWithLoggingHelpers.java:166)
org.apache.brooklyn.util.core.task.system.internal.ExecWithLoggingHelpers$3.apply(ExecWithLoggingHelpers.java:164)
org.apache.brooklyn.util.pool.BasicPool.exec(BasicPool.java:146)
org.apache.brooklyn.location.ssh.SshMachineLocation.execSsh(SshMachineLocation.java:611)
org.apache.brooklyn.location.ssh.SshMachineLocation$13.execWithTool(SshMachineLocation.java:790)
org.apache.brooklyn.util.core.task.system.internal.ExecWithLoggingHelpers.execWithLogging(ExecWithLoggingHelpers.java:164)
org.apache.brooklyn.util.core.task.system.internal.ExecWithLoggingHelpers.execScript(ExecWithLoggingHelpers.java:80)
org.apache.brooklyn.location.ssh.SshMachineLocation.execScript(SshMachineLocation.java:774)
org.apache.brooklyn.entity.software.base.AbstractSoftwareProcessSshDriver.execute(AbstractSoftwareProcessSshDriver.java:272)
org.apache.brooklyn.entity.software.base.lifecycle.ScriptHelper.executeInternal(ScriptHelper.java:366)
org.apache.brooklyn.entity.software.base.lifecycle.ScriptHelper$8.call(ScriptHelper.java:287)
org.apache.brooklyn.entity.software.base.lifecycle.ScriptHelper$8.call(ScriptHelper.java:285)
org.apache.brooklyn.util.core.task.DynamicSequentialTask$DstJob.call(DynamicSequentialTask.java:359)
org.apache.brooklyn.util.core.task.BasicExecutionManager$SubmissionCallable.call(BasicExecutionManager.java:519)
{noformat}
Running {{netstat -antp TCP}} on my local machine, I still see an established
ssh connection:
{noformat}
tcp4 0 0 10.104.3.10.54535 10.104.1.193.22 ESTABLISHED
{noformat}
I do *not* see a corresponding entry when I run {{sudo netsat -anp}} on the
target VM.
---
Looking in the Brooklyn code at {{SshjTool$ShellAction.create}}, I wonder what
else we could call on sshj to check if our connection is ok and/or the command
has actually completed. We are already calling {{shell.isOpen()}} and
{{session.getExitStatus()!=null}}. We could add calls to {{session.isOpen()}},
{{session.getExitSignal()}} and/or {{session.getExitWasCoreDumped()}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)