Ladislav Thon created JCLOUDS-1092:
--------------------------------------

             Summary: Azure: ComputeService.resumeNode spins in a timeout loop 
that doesn't have a chance to exit early
                 Key: JCLOUDS-1092
                 URL: https://issues.apache.org/jira/browse/JCLOUDS-1092
             Project: jclouds
          Issue Type: Bug
          Components: jclouds-labs
    Affects Versions: 1.9.2
            Reporter: Ladislav Thon


This is going to be a slightly longer text, so please bear with me.

Invoking {{ComputeService.resumeNode}} with the Azure provider goes through 
these layers:

- {{BaseComputeService.resumeNode}}
- {{AdaptingComputeServiceStrategies.resumeNode}}
- {{AzureComputeServiceAdapter.resumeNode}}

The problem manifests when traversing the callstack back up, so let's assume we 
got down to {{AzureComputeServiceAdapter.resumeNode}}. Also, the problem only 
appears for us when calling {{suspendNode}} and then {{resumeNode}} in rapid 
succession, but that's out of JClouds's control.

When the {{trackRequest}} method returns 
(https://github.com/jclouds/jclouds-labs/blob/fe24698d81/azurecompute/src/main/java/org/jclouds/azurecompute/compute/AzureComputeServiceAdapter.java#L383),
 it means that the asynchronous operation "start node" succeeded -- but that 
doesn't mean that the node is already running. In fact, it's only just starting 
-- I was able to confirm that in the debugger by calling 
{{api.getDeploymentApiForService(id).get(id)}} and inspecting the 
{{roleInstanceList}}.

When we get one layer back up, the 
{{AdaptingComputeServiceStrategies.resumeNode}} method calls {{getNode}} (see 
https://github.com/jclouds/jclouds/blob/b9322c583d/compute/src/main/java/org/jclouds/compute/strategy/impl/AdaptingComputeServiceStrategies.java#L164),
 which delegates to {{AzureComputeServiceAdapter.getNode}}.

{{AzureComputeServiceAdapter.getNode}} only returns non-{{null}} value when all 
of the deployment's role instances are in a settled state (non-transient), see 
https://github.com/jclouds/jclouds-labs/blob/fe24698d81/azurecompute/src/main/java/org/jclouds/azurecompute/compute/AzureComputeServiceAdapter.java#L269
 So when the node is only just starting, {{AzureComputeServiceAdapter.getNode}} 
will return {{null}}.

Again one layer back up: {{AdaptingComputeServiceStrategies.getNode}} returns 
{{null}} and hence {{AdaptingComputeServiceStrategies.resumeNode}} also returns 
{{null}}.

One more layer back up: {{BaseComputeService.resumeNode}} will call the 
{{nodeRunning}} predicate with an {{AtomicReference}} of {{null}}, see 
https://github.com/jclouds/jclouds/blob/b9322c583d/compute/src/main/java/org/jclouds/compute/internal/BaseComputeService.java#L470

The predicate is a 
{{ComputeServiceTimeoutsModule.RetryablePredicateGuardingNull}} which delegates 
to {{Predicates2.RetryablePredicate}} and through that to 
{{AtomicNodeRunning}}. That is a subclass of 
{{RefreshAndDoubleCheckOnFailUnlessStatusInvalid}}, which will always return 
{{false}} when the resource is {{null}}, see 
https://github.com/jclouds/jclouds/blob/b9322c583d/compute/src/main/java/org/jclouds/compute/predicates/internal/RefreshAndDoubleCheckOnFailUnlessStatusInvalid.java#L63
 There's also some kind of status refreshing, but that will never happen if the 
resource (node, in this case) is {{null}} (there's nothing to refresh).

All in all, the {{Predicates2.RetryablePredicate}} will spin on and on, until 
it times out, because for {{null}}, there's no chance it will exit early.

After the timeout, {{BaseComputeService.resumeNode}} prints that resuming node 
was not successful and returns. The problems are:

- the retrying predicate is spinning uselessly
- we have actually no idea about the status of the node when {{resumeNode}} 
returns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to