> On Aug. 11, 2015, 8:24 p.m., John Speidel wrote:
> > ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java,
> >  line 169
> > <https://reviews.apache.org/r/37356/diff/1/?file=1037718#file1037718line169>
> >
> >     Just got done discussing with Bob on a call.
> >     
> >     This synchronized block should be moved up to line 153 and end at the 
> > bottom of the method.  The existing availbleHosts synchronization block on 
> > line 195 should be removed due to the existence of the new availableHosts 
> > syncronization described above.  
> >     
> >     The important thing here is that we synchronize on a common mutual 
> > exclusion monitor in both onHostRegistered and processRequest.  With this 
> > patch it will be the availableHosts monitor.
> >     
> >     I agree with the other comments that this is dangerous any likely 
> > inefficient and also agree that we should change as little as possible 
> > initially while still fixing the issue.  If this patch causes a significant 
> > performance degradation then we can then look at optimizing the locking in 
> > this class in another patch.

Version 2 of the patch addresses this issue.


- Robert


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37356/#review94991
-----------------------------------------------------------


On Aug. 12, 2015, 9:54 p.m., Robert Nettleton wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37356/
> -----------------------------------------------------------
> 
> (Updated Aug. 12, 2015, 9:54 p.m.)
> 
> 
> Review request for Ambari, Jonathan Hurley, John Speidel, Mahadev Konar, 
> Robert Levas, and Sumit Mohanty.
> 
> 
> Bugs: AMBARI-12720
>     https://issues.apache.org/jira/browse/AMBARI-12720
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> This patch provides a fix for AMBARI-12720. 
> 
> Blueprint Cluster deployments are failing under load, with clusters of 50+ 
> nodes.  One or more hosts in the cluster fail to be registered, even though 
> the ambari-agent registration occurs properly for these hosts.  This problem 
> seems to occur when certain hosts are registered at roughly the same time as 
> the call to TopologyManager.processRequest().  
> 
> Upon further analysis, it appears that the 
> TopologyManager.outstandingRequests object is not fully sychronized.  
> 
> There are two paths involved in reading/writing to this data structure:
> 
> 1. TopologyManager.processRequest() (occurs on the main request thread 
> handling the cluster creation request).  When all known hosts have been 
> processed, but some requests are pending, the LogicalRequest instance is 
> added to TopologyManager.outstandingRequests.  Note:  The call to add this 
> request is not synchronized on the outstandingRequests object, but is only 
> synchronized on TopologyManager.availableHosts. 
> 2. TopologyManager.onHostRegistered() (occurs on a separate thread that 
> handles ambari-agent registrations for new hosts joining the cluster).  This 
> method does synchronize on TopologyManager.outstandingRequests, and attempts 
> to iterate over these requests to determine if the newly registered host can 
> satisfy the request.  
> 
> The incorrect synchronization on TopologyManager.outstandingRequests appears 
> to be the root of the intermittent cluster deployment problems described in 
> AMBARI-12720.  Since this datastructure is read/updated by both threads, and 
> since this object is not protected by the same locks, there appears to be a 
> race condition in this code.  
> 
> To fix this problem, this patch implements the following:
> 
> 1. Updates the TopologyManager.onHostRegistered() method to lock on
>         --> availableHosts
>             --> outstandingRequests
>    prior to reading TopologyManager.outstandingRequests
>    
> 2. Updates the TopologyManager.processRequest() method to lock on
>         --> availableHosts
>             --> outstandingRequests
>    prior to adding a LogicalRequest instance to this collection
>    
> 
> The locking order (availableHosts, outstandingRequests) must be kept in place 
> in order to avoid any potential deadlocks from making a change like this.
> 
> 
> Diffs
> -----
> 
>   
> ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java
>  4f51d47 
> 
> Diff: https://reviews.apache.org/r/37356/diff/
> 
> 
> Testing
> -------
> 
> 1. Ran the ambari-server unit test suite ("mvn clean test").  All tests are 
> passing.
> 2. Deployed a 3-node HDFS HA cluster via a Blueprint, to make sure this 
> change does not break existing deployments.
> 3. Deployed a 45-node HDFS HA cluster via a Blueprint on the cloud, to make 
> sure this change does not break deployments of larger clusters.
> 
> 
> Thanks,
> 
> Robert Nettleton
> 
>

Reply via email to