Re: Review Request 34677: Blueprint cluster provision occasionally fails due to out of order database writes

John Speidel Tue, 26 May 2015 12:43:33 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34677/
-----------------------------------------------------------

(Updated May 26, 2015, 7:42 p.m.)

Review request for Ambari, Robert Nettleton and Tom Beerbower.

Bugs: AMBARI-11394
https://issues.apache.org/jira/browse/AMBARI-11394

Repository: ambari

Description
-------

Provisioning a cluster may occasionally fail to complete as a result of an out
of order database write.
This error presents itself as start task(s) that never progresses beyond the
PENDING state. For these logical pending tasks, there are no associated
physical tasks.
When a host is matched to a host request, an install request is submitted
followed immediately by a start request. The install task transitions all host
components desired_state for the host from INIT to INSTALLED. But, because of
an error in the persistence layer, after the desired_state is set to INSTALLED,
it is overwritten on another thread (heartbeat handler thread) to INIT. As a
result, the component is never started because it it's desired state is INIT
and isn't processed by the start operation.
The root cause of this is that the public method
ServiceComponentHostImpl.handleEvent() is annotated with '@Transactional'.
Inside of this method the proper locks are acquired, BUT because this method is
marked as @Transactional it's invocation is wrapped in a proxy which wraps the
method invocation in a transaction. As a result, the transaction is committed
in the proxy after the method returns outside of any synchronization which
allows for out of order writes.

Diffs
-----

ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java
dd06eb5

Diff: https://reviews.apache.org/r/34677/diff/

Testing (updated)
-------

- provisioned clusters via BP
- currently re-running unit test suite and will update with results prior to
merging

Because this is a timing issue which according to a user only occurs for them
once every ~150 clusters and I have been unable to reproduce, I wan't able to
verify that this patch completely fixes this issue. But, I can say with
certainty that this the issue that was fixed could manifest itself precisely as
the bug describes.

Thanks,

John Speidel

Re: Review Request 34677: Blueprint cluster provision occasionally fails due to out of order database writes

Reply via email to