The use case is:
- creating a START ALL (or STOP ALL) services request
- aborting it immediately

This is happening in Ambari:
- on THREAD 1 org.apache.ambari.server.actionmanager.ActionScheduler.doWork() 
processes the cancel event an aborts all the HRCs belong to the request in 
question
- asked Andrew O. if there was any change that we return a 'failed' event upon 
aborting a HRC on the agent side; he said that this works like this for a long 
time. SO something on the server side must have been changed. Within 
org.apache.ambari.server.agent.HeartbeatProcessor.processCommandReports(List<CommandReport>,
 String, Long) we interact with HRCs in the following cases on THREAD 2 (while 
processing command reports received from agents):

1. First we fetch the tasks map using 
org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getTasks(Collection<Long>).
 This code uses a Cache from Guava and this is the only place where we populate 
it (iterating over on HRCs fetched by hostRoleCommandDAO.findByPKs(List<Long>): 
in case the given HRC is not in the cahce we add it).

2. Second - on a FAILED report status - we check if the corresponding HRC is in 
progress or not using 
org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getTask(long, 
boolean)

3. Finally we update state machines from reports using 
actionManager.processTaskResponse(hostName, reports, commands) (see at the very 
end of the method)


During my tests I found that any of these operations on THREAD 2 has been 
executed in the middle of THREAD 1 work thus I introduced locking: while we are 
aborting HRCs on THREAD 1 we should not allow reading/updating HRC states on 
THREAD 2. I expected that this solves the issue since the persisten context 
should be in good shape (all ABORTED HRCs are merged; I also modified this to 
merge one by one before we invoke audit log since it was not correct to log as 
"ABORTED" when we simply iterated over and populated fields but actual merging 
happend later).

During the time I was writing this explanation I noticed the following:
- 
org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.abortOperation(long)
 did not populate the Guava Cache which - IMO - should be done
- I missed a Java lock in 
org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getTasks(Collection<Long>)
 whereas I should have added it since it fetches HRCs from the persistent 
context

After implementing these two items and removed the 'refresh' calls it was all 
good.

Please review my latest patch.

[ Full content available at: https://github.com/apache/ambari/pull/2411 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to