[jira] [Commented] (MAPREDUCE-3711) AppMaster recovery for Medium to large jobs take long time

Robert Joseph Evans (Commented) (JIRA) Mon, 30 Jan 2012 14:43:34 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196502#comment-13196502
 ]


Robert Joseph Evans commented on MAPREDUCE-3711:
------------------------------------------------

OK I have mapped out the FileOutputCommitter  directory structure and I have 
read through as much of the code as I can.

{noformat}
<outputPath>/_temporary/<appAttemptID (int)>/_temporary/_<taskAttemptID 
(string)>/
            |-----JobAttemptBaseDirName-----|
            |------------------tmpDir------------------|
            
|-----------------------taskAttemptBaseDirName-----------------------|
|-------------------------------------workDir------------------------------------|
{noformat}

setupJob() creates the tmpDir directory.
commitJob() deletes the tmpDir directory, moves everything else in 
JobAttemptBaseDirName to outputPath and then deletes _temporary under 
outputPath.
cleanupJob() and abortJob() just delete _temporary under outputPath.
setupTask() is a noop.
commitTask() moves everything under workDir to JobAttemptBaseDirName.
abortTask() deletes workdir.
recoverTask() moves everything under JobAttemptBaseDir - 1 to JobAttemptBaseDir.

The problem is that we cannot just recover a single task with the current 
directory structure.  The FileOutputFormat API allows for a user to put 
anything they want into workDir.  The onus is on the user to be sure that the 
output of one task will not collide with the output from another task.  We 
provide some APIs to make this simple, and if it is just a normal mapper or 
reducer output then we handle that internally, but if it does collide we 
happily delete the first output file, and move in the new one to replace it.  
This makes it impossible to recover the first completed task, without 
recovering the second one too.

There are two possible ways to fix recoverTask that I see.

The first one is to add in a recoverJob API in addition to recoverTask.  In the 
case of FileOutputFormat recoverJob would be implemented to do what recoverTask 
does now, except it would also delete the _temporary directory under the 
JobAttemptBaseDirName.  recoverTask would then become a noop for 
FileOutputFormat.

The second option is to completely rewrite the way that FileOutputFormat stores 
intermediate results.  We would keep the output form each task separate until 
the Job is committed.  That way would could recover each task one at a time.

I am fine with either way.  As Vinod has also asked me to clean up the code 
redoing the directory layout too is not that big of a deal.  However I am 
leaning towards adding in recoverJob as it seems like it is a good API to have 
in OutputFormat to begin with, and it is the smallest change to make this work.

If someone feels otherwise please post a comment here.  In the meantime I will 
try to get a patch up that adds in recoverJob.
                
> AppMaster recovery for Medium to large jobs take long time
> ----------------------------------------------------------
>
>                 Key: MAPREDUCE-3711
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3711
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>
> Reported by [~karams]
> yarn.resourcemanager.am.max-retries=2
> Ran test cases with sort job on 350 scale having 16800 maps and 680 reduces -:
> 1. After 70 secs of Job Sumbission Am is killed using kill -9, around 3900 
> maps were completed and 680 reduces were
> scheduled, Second AM got restart. Job got completed in 980 secs. AM took very 
> less time to recover.
> 2. After 150 secs of Job Sumbission AM is killed using kill -9, around 90% 
> maps were completed and 680 reduces were
> scheduled , Second AM got restart Job got completed in 1000 secs. AM got 
> revocer.
> 3. After 150 secs of Job Sumbission AM as killed using kill -9, almost all 
> maps were completed and only 680 reduces
> were running, Recovery was too slow, AM was still revocering after 1hr :40 
> mis when I killed the run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3711) AppMaster recovery for Medium to large jobs take long time

Reply via email to