Ladislav Thon created JCLOUDS-1092:
--------------------------------------
Summary: Azure: ComputeService.resumeNode spins in a timeout loop
that doesn't have a chance to exit early
Key: JCLOUDS-1092
URL: https://issues.apache.org/jira/browse/JCLOUDS-1092
Project: jclouds
Issue Type: Bug
Components: jclouds-labs
Affects Versions: 1.9.2
Reporter: Ladislav Thon
This is going to be a slightly longer text, so please bear with me.
Invoking {{ComputeService.resumeNode}} with the Azure provider goes through
these layers:
- {{BaseComputeService.resumeNode}}
- {{AdaptingComputeServiceStrategies.resumeNode}}
- {{AzureComputeServiceAdapter.resumeNode}}
The problem manifests when traversing the callstack back up, so let's assume we
got down to {{AzureComputeServiceAdapter.resumeNode}}. Also, the problem only
appears for us when calling {{suspendNode}} and then {{resumeNode}} in rapid
succession, but that's out of JClouds's control.
When the {{trackRequest}} method returns
(https://github.com/jclouds/jclouds-labs/blob/fe24698d81/azurecompute/src/main/java/org/jclouds/azurecompute/compute/AzureComputeServiceAdapter.java#L383),
it means that the asynchronous operation "start node" succeeded -- but that
doesn't mean that the node is already running. In fact, it's only just starting
-- I was able to confirm that in the debugger by calling
{{api.getDeploymentApiForService(id).get(id)}} and inspecting the
{{roleInstanceList}}.
When we get one layer back up, the
{{AdaptingComputeServiceStrategies.resumeNode}} method calls {{getNode}} (see
https://github.com/jclouds/jclouds/blob/b9322c583d/compute/src/main/java/org/jclouds/compute/strategy/impl/AdaptingComputeServiceStrategies.java#L164),
which delegates to {{AzureComputeServiceAdapter.getNode}}.
{{AzureComputeServiceAdapter.getNode}} only returns non-{{null}} value when all
of the deployment's role instances are in a settled state (non-transient), see
https://github.com/jclouds/jclouds-labs/blob/fe24698d81/azurecompute/src/main/java/org/jclouds/azurecompute/compute/AzureComputeServiceAdapter.java#L269
So when the node is only just starting, {{AzureComputeServiceAdapter.getNode}}
will return {{null}}.
Again one layer back up: {{AdaptingComputeServiceStrategies.getNode}} returns
{{null}} and hence {{AdaptingComputeServiceStrategies.resumeNode}} also returns
{{null}}.
One more layer back up: {{BaseComputeService.resumeNode}} will call the
{{nodeRunning}} predicate with an {{AtomicReference}} of {{null}}, see
https://github.com/jclouds/jclouds/blob/b9322c583d/compute/src/main/java/org/jclouds/compute/internal/BaseComputeService.java#L470
The predicate is a
{{ComputeServiceTimeoutsModule.RetryablePredicateGuardingNull}} which delegates
to {{Predicates2.RetryablePredicate}} and through that to
{{AtomicNodeRunning}}. That is a subclass of
{{RefreshAndDoubleCheckOnFailUnlessStatusInvalid}}, which will always return
{{false}} when the resource is {{null}}, see
https://github.com/jclouds/jclouds/blob/b9322c583d/compute/src/main/java/org/jclouds/compute/predicates/internal/RefreshAndDoubleCheckOnFailUnlessStatusInvalid.java#L63
There's also some kind of status refreshing, but that will never happen if the
resource (node, in this case) is {{null}} (there's nothing to refresh).
All in all, the {{Predicates2.RetryablePredicate}} will spin on and on, until
it times out, because for {{null}}, there's no chance it will exit early.
After the timeout, {{BaseComputeService.resumeNode}} prints that resuming node
was not successful and returns. The problems are:
- the retrying predicate is spinning uselessly
- we have actually no idea about the status of the node when {{resumeNode}}
returns
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)