[jira] [Updated] (MAPREDUCE-3711) AppMaster recovery for Medium to large jobs take long time

Vinod Kumar Vavilapalli (Updated) (JIRA) Wed, 01 Feb 2012 21:38:26 -0800

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vinod Kumar Vavilapalli updated MAPREDUCE-3711:
-----------------------------------------------

       Fix Version/s: 0.23.1
    Target Version/s: 0.23.1, 0.24.0  (was: 0.24.0, 0.23.1)
              Status: Open  (was: Patch Available)

Did a full review, quite an involved change. The main FileOutputCommitter(FOC) 
changes are fine, some comments:
 - The condition in RecoveryService is wrong. It should be (!(!(iAmAMap && 
numReduces == 0)). Or simpler, (iAMReduce || numReduces > 0). Please see if you 
can add a test case validating different cases possible here (only maps, maps + 
reduces cross with recovery for maps, recovery for reduces)
 - This bug also happens with mapred.FOC, which we need to fix. While you are 
at it, please see if we can reuse code, there are large chunks of code that are 
the same in mapred.FOC and mapreduce.lib.input.FOC.
 - A test which recovers multiple tasks would have caught this issue, can you 
do that in this patch?
                
> AppMaster recovery for Medium to large jobs take long time
> ----------------------------------------------------------
>
>                 Key: MAPREDUCE-3711
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3711
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: mrv2
>    Affects Versions: 0.23.0, 0.24.0
>            Reporter: Siddharth Seth
>            Assignee: Robert Joseph Evans
>            Priority: Blocker
>             Fix For: 0.23.1
>
>         Attachments: MR-3711.txt, MR-3711.txt
>
>
> Reported by [~karams]
> yarn.resourcemanager.am.max-retries=2
> Ran test cases with sort job on 350 scale having 16800 maps and 680 reduces -:
> 1. After 70 secs of Job Sumbission Am is killed using kill -9, around 3900 
> maps were completed and 680 reduces were
> scheduled, Second AM got restart. Job got completed in 980 secs. AM took very 
> less time to recover.
> 2. After 150 secs of Job Sumbission AM is killed using kill -9, around 90% 
> maps were completed and 680 reduces were
> scheduled , Second AM got restart Job got completed in 1000 secs. AM got 
> revocer.
> 3. After 150 secs of Job Sumbission AM as killed using kill -9, almost all 
> maps were completed and only 680 reduces
> were running, Recovery was too slow, AM was still revocering after 1hr :40 
> mis when I killed the run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-3711) AppMaster recovery for Medium to large jobs take long time

Reply via email to