-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31341/
-----------------------------------------------------------

Review request for Ambari, Nate Cole and Tom Beerbower.


Bugs: AMBARI-9761
    https://issues.apache.org/jira/browse/AMBARI-9761


Repository: ambari


Description
-------

Another case of misunderstanding how locks work.

During provisioning of a cluster with at least 200 hosts, Ambari Server becomes 
unresponsive. Based on the thread dump, there exists a deadlock between:
- Cluster readers
- Cluster writers
- ServiceComponentHost writers

qtp626652285-97   ClusterImpl.convertToResponse() (cluster readLock)
qtp1282624353-47  ServiceComponentHostImpl.setRestartRequired() (sch writeLock)
qtp626652285-97   ServiceComponentHostImpl.getMaintenanceState() (sch readLock 
BLOCKED by qtp1282624353-47)
qtp1282624353-60  ClusterImpl.recalculateClusterVersionState() (cluster 
writeLock BLOCKED by qtp626652285-97)
qtp1282624353-47  ServiceComponentHostImpl.isPersisted() (cluster readLock 
BLOCKED by qtp1282624353-60)

The underlying problem is that a writeLock.lock() is parked which causes all 
subsequent readLock.lock() requests to also park. This includes the request 
from qtp1282624353-47 which is holding a writeLock on the SCH which, in turn, 
is blocking qtp626652285-97 (the original cluster readLock reader which blocks 
the cluster write)

Long story short is that I think we need to revisit locks again after 2.0.0; I 
just don't see a need for locking on reads in most places - that's what the 
database is doing for us.


Diffs
-----

  
ambari-server/src/main/java/org/apache/ambari/server/events/listeners/upgrade/StackVersionListener.java
 117526c 
  ambari-server/src/main/java/org/apache/ambari/server/state/ServiceImpl.java 
0de62ea 
  
ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java
 c43044c 
  
ambari-server/src/test/java/org/apache/ambari/server/state/cluster/ClusterDeadlockTest.java
 96a1443 

Diff: https://reviews.apache.org/r/31341/diff/


Testing
-------

Reproduced the deadlock in a unit test first, and then verified the deadlock 
does not occur anymore in the test after applying the patch.


Thanks,

Jonathan Hurley

Reply via email to