> On Aug. 11, 2015, 5:04 p.m., Sumit Mohanty wrote:
> > ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java,
> >  line 373
> > <https://reviews.apache.org/r/37356/diff/1/?file=1037718#file1037718line373>
> >
> >     Is this necessary if availableHosts is what we are using to synchronize 
> > both the calls?
> 
> Robert Nettleton wrote:
>     Please see my comment above for more context.
>     
>     As I understand it, both locks here are necessary. 
>     
>     Previously, only one of the methods would lock on the availablehosts, and 
> then directly access the outstandingRequests.  
>     
>     Locking on both objects seems to be the safest, and most straightforward 
> approach, but I'm happy to entertain other approaches if we think they are 
> necessary.  
>     
>     Thanks.
> 
> Sumit Mohanty wrote:
>     Agree, locking both should be OK - just wanted to ensure that we are not 
> locking unnecessarily. This is my understanding - synchronizing on 
> availableHosts will ensure that onHostRegistered() cannot proceed till 
> processRequest() creates outstanding requests. Is that a valid understanding?

I believe you are correct.  Synchronizing on availableHosts in 
onHostRegistered(), rather than outstandingRequests, could potentially be a fix 
as well.  

My concern was that I did not want to disturb the existing locks too much to 
fix this bug.  Since one of the methods in question locked on availableHosts, 
and one locked on outstandingRequests, it seemed safest to use both, and order 
them appropriately.


- Robert


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37356/#review94934
-----------------------------------------------------------


On Aug. 11, 2015, 3:30 p.m., Robert Nettleton wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37356/
> -----------------------------------------------------------
> 
> (Updated Aug. 11, 2015, 3:30 p.m.)
> 
> 
> Review request for Ambari, Jonathan Hurley, John Speidel, Mahadev Konar, 
> Robert Levas, and Sumit Mohanty.
> 
> 
> Bugs: AMBARI-12720
>     https://issues.apache.org/jira/browse/AMBARI-12720
> 
> 
> Repository: ambari
> 
> 
> Description
> -------
> 
> This patch provides a fix for AMBARI-12720. 
> 
> Blueprint Cluster deployments are failing under load, with clusters of 50+ 
> nodes.  One or more hosts in the cluster fail to be registered, even though 
> the ambari-agent registration occurs properly for these hosts.  This problem 
> seems to occur when certain hosts are registered at roughly the same time as 
> the call to TopologyManager.processRequest().  
> 
> Upon further analysis, it appears that the 
> TopologyManager.outstandingRequests object is not fully sychronized.  
> 
> There are two paths involved in reading/writing to this data structure:
> 
> 1. TopologyManager.processRequest() (occurs on the main request thread 
> handling the cluster creation request).  When all known hosts have been 
> processed, but some requests are pending, the LogicalRequest instance is 
> added to TopologyManager.outstandingRequests.  Note:  The call to add this 
> request is not synchronized on the outstandingRequests object, but is only 
> synchronized on TopologyManager.availableHosts. 
> 2. TopologyManager.onHostRegistered() (occurs on a separate thread that 
> handles ambari-agent registrations for new hosts joining the cluster).  This 
> method does synchronize on TopologyManager.outstandingRequests, and attempts 
> to iterate over these requests to determine if the newly registered host can 
> satisfy the request.  
> 
> The incorrect synchronization on TopologyManager.outstandingRequests appears 
> to be the root of the intermittent cluster deployment problems described in 
> AMBARI-12720.  Since this datastructure is read/updated by both threads, and 
> since this object is not protected by the same locks, there appears to be a 
> race condition in this code.  
> 
> To fix this problem, this patch implements the following:
> 
> 1. Updates the TopologyManager.onHostRegistered() method to lock on
>         --> availableHosts
>             --> outstandingRequests
>    prior to reading TopologyManager.outstandingRequests
>    
> 2. Updates the TopologyManager.processRequest() method to lock on
>         --> availableHosts
>             --> outstandingRequests
>    prior to adding a LogicalRequest instance to this collection
>    
> 
> The locking order (availableHosts, outstandingRequests) must be kept in place 
> in order to avoid any potential deadlocks from making a change like this.
> 
> 
> Diffs
> -----
> 
>   
> ambari-server/src/main/java/org/apache/ambari/server/topology/TopologyManager.java
>  8e7fb7f 
> 
> Diff: https://reviews.apache.org/r/37356/diff/
> 
> 
> Testing
> -------
> 
> 1. Ran the ambari-server unit test suite ("mvn clean test").  All tests are 
> passing.
> 2. Deployed a 3-node HDFS HA cluster via a Blueprint, to make sure this 
> change does not break existing deployments.
> 3. Deployed a 45-node HDFS HA cluster via a Blueprint on the cloud, to make 
> sure this change does not break deployments of larger clusters.
> 
> 
> Thanks,
> 
> Robert Nettleton
> 
>

Reply via email to