-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37356/
-----------------------------------------------------------

(Updated Aug. 11, 2015, 3:30 p.m.)


Review request for Ambari, Jonathan Hurley, John Speidel, Mahadev Konar, Robert 
Levas, and Sumit Mohanty.


Bugs: AMBARI-12720
    https://issues.apache.org/jira/browse/AMBARI-12720


Repository: ambari


Description
-------

This patch provides a fix for AMBARI-12720. 

Blueprint Cluster deployments are failing under load, with clusters of 50+ 
nodes.  One or more hosts in the cluster fail to be registered, even though the 
ambari-agent registration occurs properly for these hosts.  This problem seems 
to occur when certain hosts are registered at roughly the same time as the call 
to TopologyManager.processRequest().  

Upon further analysis, it appears that the TopologyManager.outstandingRequests 
object is not fully sychronized.  

There are two paths involved in reading/writing to this data structure:

1. TopologyManager.processRequest() (occurs on the main request thread handling 
the cluster creation request).  When all known hosts have been processed, but 
some requests are pending, the LogicalRequest instance is added to 
TopologyManager.outstandingRequests.  Note:  The call to add this request is 
not synchronized on the outstandingRequests object, but is only synchronized on 
TopologyManager.availableHosts. 
2. TopologyManager.onHostRegistered() (occurs on a separate thread that handles 
ambari-agent registrations for new hosts joining the cluster).  This method 
does synchronize on TopologyManager.outstandingRequests, and attempts to 
iterate over these requests to determine if the newly registered host can 
satisfy the request.  

The incorrect synchronization on TopologyManager.outstandingRequests appears to 
be the root of the intermittent cluster deployment problems described in 
AMBARI-12720.  Since this datastructure is read/updated by both threads, and 
since this object is not protected by the same locks, there appears to be a 
race condition in this code.  

To fix this problem, this patch implements the following:

1. Updates the TopologyManager.onHostRegistered() method to lock on
        --> availableHosts
            --> outstandingRequests
   prior to reading TopologyManager.outstandingRequests
   
2. Updates the TopologyManager.processRequest() method to lock on
        --> availableHosts
            --> outstandingRequests
   prior to adding a LogicalRequest instance to this collection
   

The locking order (availableHosts, outstandingRequests) must be kept in place 
in order to avoid any potential deadlocks from making a change like this.


Diffs
-----

  
ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java
 8e7fb7f 

Diff: https://reviews.apache.org/r/37356/diff/


Testing (updated)
-------

1. Ran the ambari-server unit test suite ("mvn clean test").  All tests are 
passing.
2. Deployed a 3-node HDFS HA cluster via a Blueprint, to make sure this change 
does not break existing deployments.
3. Deployed a 45-node HDFS HA cluster via a Blueprint on the cloud, to make 
sure this change does not break deployments of larger clusters.


Thanks,

Robert Nettleton

Reply via email to