-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/31341/
-----------------------------------------------------------
Review request for Ambari, Nate Cole and Tom Beerbower.
Bugs: AMBARI-9761
https://issues.apache.org/jira/browse/AMBARI-9761
Repository: ambari
Description
-------
Another case of misunderstanding how locks work.
During provisioning of a cluster with at least 200 hosts, Ambari Server becomes
unresponsive. Based on the thread dump, there exists a deadlock between:
- Cluster readers
- Cluster writers
- ServiceComponentHost writers
qtp626652285-97 ClusterImpl.convertToResponse() (cluster readLock)
qtp1282624353-47 ServiceComponentHostImpl.setRestartRequired() (sch writeLock)
qtp626652285-97 ServiceComponentHostImpl.getMaintenanceState() (sch readLock
BLOCKED by qtp1282624353-47)
qtp1282624353-60 ClusterImpl.recalculateClusterVersionState() (cluster
writeLock BLOCKED by qtp626652285-97)
qtp1282624353-47 ServiceComponentHostImpl.isPersisted() (cluster readLock
BLOCKED by qtp1282624353-60)
The underlying problem is that a writeLock.lock() is parked which causes all
subsequent readLock.lock() requests to also park. This includes the request
from qtp1282624353-47 which is holding a writeLock on the SCH which, in turn,
is blocking qtp626652285-97 (the original cluster readLock reader which blocks
the cluster write)
Long story short is that I think we need to revisit locks again after 2.0.0; I
just don't see a need for locking on reads in most places - that's what the
database is doing for us.
Diffs
-----
ambari-server/src/main/java/org/apache/ambari/server/events/listeners/upgrade/StackVersionListener.java
117526c
ambari-server/src/main/java/org/apache/ambari/server/state/ServiceImpl.java
0de62ea
ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java
c43044c
ambari-server/src/test/java/org/apache/ambari/server/state/cluster/ClusterDeadlockTest.java
96a1443
Diff: https://reviews.apache.org/r/31341/diff/
Testing
-------
Reproduced the deadlock in a unit test first, and then verified the deadlock
does not occur anymore in the test after applying the patch.
Thanks,
Jonathan Hurley