[jira] [Commented] (MAPREDUCE-3711) AppMaster recovery for Medium to large jobs take long time

Vinod Kumar Vavilapalli (Commented) (JIRA) Fri, 27 Jan 2012 12:16:36 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13195063#comment-13195063
 ]


Vinod Kumar Vavilapalli commented on MAPREDUCE-3711:
----------------------------------------------------

That is the bug. I can't believe the tests are passing, guess tests validate 
the code instead of the requirements *smile*

The bug is in FileOutputCommitter itself. In recoverTask(), pathToRecover 
points to the job directory instead of the task directory.

bq. Another thing that I am a bit confused about, is why we are even trying to 
recover the mapper output. It is not a map only job. It has Reducers too.
The code is supposed to check for existence of task directory on DFS and move 
only if it exists. For map+reduce jobs, the recovery won't find any dirs for 
the map-tasks. Granted, this can be optimized by not making any trips to DFS, 
but that seems like the lesser issue to me. Anyways, please do investigate how 
hard it will be for RecoverService to not invoke recoverTask() for maps in 
maps+reduce jobs.

Can you provide a patch? Also, while you are at it, can you rename things in 
FileoutputCommitter. It was so hard to read this code with all the attempt 
nomenclature confusing. May be we should call Job-attempts as Job-generations 
to make things saner.
                
> AppMaster recovery for Medium to large jobs take long time
> ----------------------------------------------------------
>
>                 Key: MAPREDUCE-3711
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3711
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>
> Reported by [~karams]
> yarn.resourcemanager.am.max-retries=2
> Ran test cases with sort job on 350 scale having 16800 maps and 680 reduces -:
> 1. After 70 secs of Job Sumbission Am is killed using kill -9, around 3900 
> maps were completed and 680 reduces were
> scheduled, Second AM got restart. Job got completed in 980 secs. AM took very 
> less time to recover.
> 2. After 150 secs of Job Sumbission AM is killed using kill -9, around 90% 
> maps were completed and 680 reduces were
> scheduled , Second AM got restart Job got completed in 1000 secs. AM got 
> revocer.
> 3. After 150 secs of Job Sumbission AM as killed using kill -9, almost all 
> maps were completed and only 680 reduces
> were running, Recovery was too slow, AM was still revocering after 1hr :40 
> mis when I killed the run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-3711) AppMaster recovery for Medium to large jobs take long time

Reply via email to