[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4813:
----------------------------------

    Attachment: MAPREDUCE-4813-2.patch
                JobImplStateMachine.pdf

Per Vinod's suggestions, I updated the patch to move all of the committer 
interactions in JobImpl to a separate CommitterEventHandler which was 
previously known as TaskCleaner.

The asynchronous processing of committer callbacks required adding new internal 
states to JobImpl, specifically:

* SETUP which occurs after INITED while processing the setupJob callback
* COMMITTING which occurs after RUNNING while processing the commitJob callback
* FAIL_ABORT which occurs prior to FAILED while proccessing the abortJob 
callback
* KILL_ABORT which occurs prior to KILLED while processing the abortJob callback

One significant shift with this rework is that the committer's setupJob call is 
now performed *after* INITED and after the job reports externally that it is 
RUNNING.  Previously it processed the setupJob callback synchronously within 
the MRAppMaster.start method, and this seemed like the cleanest way to handle 
the now asynchronous nature of the committer callback.

                
> AM timing out during job commit
> -------------------------------
>
>                 Key: MAPREDUCE-4813
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4813
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: JobImplStateMachine.pdf, MAPREDUCE-4813-2.patch, 
> MAPREDUCE-4813.patch, MAPREDUCE-4813.patch, MAPREDUCE-4813.patch
>
>
> The AM calls the output committer's {{commitJob}} method synchronously during 
> JobImpl state transitions, which means the JobImpl write lock is held the 
> entire time the job is being committed.  Holding the write lock prevents the 
> RM allocator thread from heartbeating to the RM.  Therefore if committing the 
> job takes too long (e.g.: the job has tons of files to commit and/or the 
> namenode is bogged down) then the AM appears to be unresponsive to the RM and 
> the RM kills the AM attempt.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to