AppMaster resovery for Medium to large jobs take long time
----------------------------------------------------------

                 Key: MAPREDUCE-3711
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3711
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2
    Affects Versions: 0.23.0
            Reporter: Siddharth Seth
            Priority: Critical


Reported by [~karams]

yarn.resourcemanager.am.max-retries=2
Ran test cases with sort job on 350 scale having 16800 maps and 680 reduces -:
1. After 70 secs of Job Sumbission Am is killed using kill -9, around 3900 maps 
were completed and 680 reduces were
scheduled, Second AM got restart. Job got completed in 980 secs. AM took very 
less time to recover.
2. After 150 secs of Job Sumbission AM is killed using kill -9, around 90% maps 
were completed and 680 reduces were
scheduled , Second AM got restart Job got completed in 1000 secs. AM got 
revocer.
3. After 150 secs of Job Sumbission AM as killed using kill -9, almost all maps 
were completed and only 680 reduces
were running, Recovery was too slow, AM was still revocering after 1hr :40 mis 
when I killed the run.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to