-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30357/
-----------------------------------------------------------
Review request for Ambari, Alejandro Fernandez, Mahadev Konar, Nate Cole, and
Tom Beerbower.
Bugs: AMBARI-9368
https://issues.apache.org/jira/browse/AMBARI-9368
Repository: ambari
Description
-------
A deadlock scenario was identified between ServiceComponentImpl and
ServiceComponentHostImpl. The root cause was the pattern of locking that is
used in ClusterImpl, ServiceImpl, ServiceComponentImpl and
ServiceComponentHostImpl. Essentially, our pattern of using 2 locks per
implementation is, in most cases, unnecessary and deadlock prone.
ClusterImpl:
I've gone through and removed the internal lock on the cluster. The cluster is
the highest level of logical organization and should only have 1 lock (hence
the reason for a cluster global lock). It made no sense to keep another
internal lock that either locked in parallel or contradicted the global lock.
Furthermore, methods on primary keys such as clusterId don't even need to lock
since their values won't change.
ServiceImpl:
Locks around static data or primary key data such as service names, cluster
IDs, etc, should not have locks around them since they will never change. The
clusterGlobalLock was removed from many methods where the JPA entity of the
service was being written to; a cluster lock should not be used unless the
state of the cluster is changing or being retrieved.
ServiceComponentImpl:
Locks around static data or primary key data such as service names, cluster
IDs, etc, should not have locks around them since they will never change. In
many cases, the clusterGlobalLock was removed for the same reasons as stated in
ServiceImpl.
ServiceComponentHostImpl
Locks around static data or primary key data such as service names, cluster
IDs, etc, should not have locks around them since they will never change. In
many cases, the clusterGlobalLock was removed for the same reasons as stated in
ServiceImpl.
Overall, the global cluster lock was left on methods that alter the topology of
the cluster (add/delete of a service/component/host) or when refreshing the
internal entity. The cluster global lock, however, once the cluster topology
has changed. The internal locks were left in place where there was work being
done around a non-thread-safe member or when an internal entity was having
non-primary-key data set.
Diffs
-----
ambari-server/src/main/java/org/apache/ambari/server/state/ServiceComponentImpl.java
a31e42c
ambari-server/src/main/java/org/apache/ambari/server/state/ServiceImpl.java
85e8314
ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClusterImpl.java
0bfaa19
ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java
dbdef2b
ambari-server/src/test/java/org/apache/ambari/server/state/cluster/ClusterDeadlockTest.java
PRE-CREATION
Diff: https://reviews.apache.org/r/30357/diff/
Testing
-------
Installed a small (5 node) and large (20 node) cluster and performed an
end-to-end install with all services. In both cases, I had two browsers opened
so there were interwoven requests during installations.
In the small cluster, I forced a yum repository error on install so that I
would need to retry the failed install. The cluster deleted and the services
were repopulated on the retry.
After both clusters were installed, I execute an Add Service from the browser
at the same time as a DELETE service request from a 2nd browser. In both cases,
the requests were scheduled correctly and the services were propagated.
Testing also included adding new hosts, and adding new host components from 2
different browers simultaneously.
New tests were added to cover threads performing read serialization on the
impl's concurrently with impl creation and writes. These tests deadlocked
before the changes.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 29:21 min
[INFO] Finished at: 2015-01-28T01:36:06-05:00
[INFO] Final Memory: 36M/443M
[INFO] ------------------------------------------------------------------------
Thanks,
Jonathan Hurley