The use case is: - creating a START ALL (or STOP ALL) services request - aborting it immediately
This is happening in Ambari: - on THREAD 1 org.apache.ambari.server.actionmanager.ActionScheduler.doWork() processes the cancel event an aborts all the HRCs belong to the request in question - asked Andrew O. if there was any change that we return a 'failed' event upon aborting a HRC on the agent side; he said that this works like this for a long time. SO something on the server side must have been changed. Within org.apache.ambari.server.agent.HeartbeatProcessor.processCommandReports(List<CommandReport>, String, Long) we interact with HRCs in the following cases on THREAD 2 (while processing command reports received from agents): 1. First we fetch the tasks map using org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getTasks(Collection<Long>). This code uses a Cache from Guava and this is the only place where we populate it (iterating over on HRCs fetched by hostRoleCommandDAO.findByPKs(List<Long>): in case the given HRC is not in the cahce we add it). 2. Second - on a FAILED report status - we check if the corresponding HRC is in progress or not using org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getTask(long, boolean) 3. Finally we update state machines from reports using actionManager.processTaskResponse(hostName, reports, commands) (see at the very end of the method) During my tests I found that any of these operations on THREAD 2 has been executed in the middle of THREAD 1 work thus I introduced locking: while we are aborting HRCs on THREAD 1 we should not allow reading/updating HRC states on THREAD 2. I expected that this solves the issue since the persisten context should be in good shape (all ABORTED HRCs are merged; I also modified this to merge one by one before we invoke audit log since it was not correct to log as "ABORTED" when we simply iterated over and populated fields but actual merging happend later). Unfortunately it was not enough; still fetched `old` data from the persistence context; that's why I added the refresh calls. **But...** During the time I was writing this explanation I noticed the following: - org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.abortOperation(long) did not populate the Guava Cache which - IMO - should be done - I missed a Java lock in org.apache.ambari.server.actionmanager.ActionDBAccessorImpl.getTasks(Collection<Long>) whereas I should have added it since it fetches HRCs from the persistent context After implementing these two items and removed the 'refresh' calls it was all good. Please review my latest patch. [ Full content available at: https://github.com/apache/ambari/pull/2411 ] This message was relayed via gitbox.apache.org for [email protected]
