[ https://issues.apache.org/jira/browse/OOZIE-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13925120#comment-13925120 ]
Rohini Palaniswamy commented on OOZIE-1533: ------------------------------------------- [~sriksun], We wanted to do that as well for performance reasons and Virag had a investigation task. But due to other high priority tasks we did not spend much time on it and has since been in backlog. Based on what I remember, few comments: - One problem that needs to be addressed before this was that there are lot of places in code where coord job is updated. We often hit problems with StatusTransitService updating coord job without a lock and leaving the coordinator in a wrong state or invalid state due to race condition with another command updating the status at the same time and having us execute database queries to restore it to proper state. - CoordActionInputCheckXCommand might not be a good idea as it will put a lot of pressure on the namenode and also the Oozie database. Just 2 days back we had one user set a high throttle factor of 24 for a bundle with 70 coordinators and it brought Oozie down. The first coordinator was marked FAILED as it failed due to db read timeout during materialization. All the other coordinators ended up with a lot of WAITING actions. Just the amount of update coord_actions set last_modified_time where id=x caused such a high load on the db causing lot more workflows and coord actions to have errors due to db exceptions. - Another thing is interrupt commands like coord kill, etc will not be processed earlier if the lock is changed to the action id. [~virag], I am sure you will have more info to add on other cases to address. Can you comment on this when you get time? > Coordinator action materialization is too slow due to coarse job level locks > ---------------------------------------------------------------------------- > > Key: OOZIE-1533 > URL: https://issues.apache.org/jira/browse/OOZIE-1533 > Project: Oozie > Issue Type: Improvement > Reporter: Srikanth Sundarrajan > Assignee: Srikanth Sundarrajan > Labels: locking > Attachments: OOZIE-1533.patch > > > Coord job level lock introduces high contention. Instead introduce coord > action level locking whenever appropriate -- This message was sent by Atlassian JIRA (v6.2#6252)