-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/30357/
-----------------------------------------------------------

Review request for Ambari, Alejandro Fernandez, Mahadev Konar, Nate Cole, and 
Tom Beerbower.


Bugs: AMBARI-9368
    https://issues.apache.org/jira/browse/AMBARI-9368


Repository: ambari


Description
-------

A deadlock scenario was identified between ServiceComponentImpl and 
ServiceComponentHostImpl. The root cause was the pattern of locking that is 
used in ClusterImpl, ServiceImpl, ServiceComponentImpl and 
ServiceComponentHostImpl. Essentially, our pattern of using 2 locks per 
implementation is, in most cases, unnecessary and deadlock prone. 

ClusterImpl:
I've gone through and removed the internal lock on the cluster. The cluster is 
the highest level of logical organization and should only have 1 lock (hence 
the reason for a cluster global lock). It made no sense to keep another 
internal lock that either locked in parallel or contradicted the global lock. 
Furthermore, methods on primary keys such as clusterId don't even need to lock 
since their values won't change.

ServiceImpl:
Locks around static data or primary key data such as service names, cluster 
IDs, etc, should not have locks around them since they will never change. The 
clusterGlobalLock was removed from many methods where the JPA entity of the 
service was being written to; a cluster lock should not be used unless the 
state of the cluster is changing or being retrieved.

ServiceComponentImpl:
Locks around static data or primary key data such as service names, cluster 
IDs, etc, should not have locks around them since they will never change. In 
many cases, the clusterGlobalLock was removed for the same reasons as stated in 
ServiceImpl.

ServiceComponentHostImpl
Locks around static data or primary key data such as service names, cluster 
IDs, etc, should not have locks around them since they will never change. In 
many cases, the clusterGlobalLock was removed for the same reasons as stated in 
ServiceImpl.

Overall, the global cluster lock was left on methods that alter the topology of 
the cluster (add/delete of a service/component/host) or when refreshing the 
internal entity. The cluster global lock, however, once the cluster topology 
has changed. The internal locks were left in place where there was work being 
done around a non-thread-safe member or when an internal entity was having 
non-primary-key data set.


Diffs
-----

  
ambari-server/src/main/java/org/apache/ambari/server/state/ServiceComponentImpl.java
 a31e42c 
  ambari-server/src/main/java/org/apache/ambari/server/state/ServiceImpl.java 
85e8314 
  
ambari-server/src/main/java/org/apache/ambari/server/state/cluster/ClusterImpl.java
 0bfaa19 
  
ambari-server/src/main/java/org/apache/ambari/server/state/svccomphost/ServiceComponentHostImpl.java
 dbdef2b 
  
ambari-server/src/test/java/org/apache/ambari/server/state/cluster/ClusterDeadlockTest.java
 PRE-CREATION 

Diff: https://reviews.apache.org/r/30357/diff/


Testing
-------

Installed a small (5 node) and large (20 node) cluster and performed an 
end-to-end install with all services. In both cases, I had two browsers opened 
so there were interwoven requests during installations. 

In the small cluster, I forced a yum repository error on install so that I 
would need to retry the failed install. The cluster deleted and the services 
were repopulated on the retry.

After both clusters were installed, I execute an Add Service from the browser 
at the same time as a DELETE service request from a 2nd browser. In both cases, 
the requests were scheduled correctly and the services were propagated.

Testing also included adding new hosts, and adding new host components from 2 
different browers simultaneously. 

New tests were added to cover threads performing read serialization on the 
impl's concurrently with impl creation and writes. These tests deadlocked 
before the changes.

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 29:21 min
[INFO] Finished at: 2015-01-28T01:36:06-05:00
[INFO] Final Memory: 36M/443M
[INFO] ------------------------------------------------------------------------


Thanks,

Jonathan Hurley

Reply via email to