[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe updated MAPREDUCE-4099:
----------------------------------

    Attachment: MAPREDUCE-4099-addendum.patch

Great catch, Sid!  Apologies for missing the race condition, I forgot that the 
history server flush was performed by the stop().  The 5 second sleep in the AM 
before it calls stop() hid this issue during my manual testing.

The addendum patch moves the staging cleanup into a service that is registered 
after the RM container allocator service but before the job history event 
handler.  This will allow the job history to be flushed and moved to done 
intermediate before the staging directory is removed, and the staging directory 
removal will still occur before unregistering with the RM.

The patch also moves the test case to a more appropriate location.
                
> ApplicationMaster may fail to remove staging directory
> ------------------------------------------------------
>
>                 Key: MAPREDUCE-4099
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4099
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.2
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>             Fix For: 0.23.3, 2.0.0
>
>         Attachments: MAPREDUCE-4099-addendum.patch, MAPREDUCE-4099.patch, 
> MAPREDUCE-4099.patch, MAPREDUCE-4099.patch
>
>
> When the ApplicationMaster shuts down it's supposed to remove the staging 
> directory, assuming properties weren't set to override this behavior. During 
> shutdown the AM tells the ResourceManager that it has finished before it 
> cleans up the staging directory.  However upon hearing the AM has finished, 
> the RM turns right around and kills the AM container.  If the AM is too slow, 
> the AM will be killed before the staging directory is removed.
> We're seeing the AM lose this race fairly consistently on our clusters, and 
> the lack of staging directory cleanup quickly leads to filesystem quota 
> issues for some users.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to