[jira] [Updated] (MAPREDUCE-4443) MR AM and job history server should be resilient to jobs that exceed counter limits

Vinod Kumar Vavilapalli (JIRA) Thu, 21 Feb 2013 16:52:14 -0800

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vinod Kumar Vavilapalli updated MAPREDUCE-4443:
-----------------------------------------------

    Labels: usability  (was: )
    
> MR AM and job history server should be resilient to jobs that exceed counter 
> limits 
> ------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4443
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4443
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.0.0-alpha
>            Reporter: Rahul Jain
>              Labels: usability
>         Attachments: am_failed_counter_limits.txt
>
>
> We saw this problem migrating applications to MapReduceV2:
> Our applications use hadoop counters extensively (1000+ counters for certain 
> jobs). While this may not be one of recommended best practices in hadoop, the 
> real issue here is reliability of the framework when applications exceed 
> counter limits.
> The hadoop servers (yarn, history server) were originally brought up with 
> mapreduce.job.counters.max=1000 under core-site.xml
> We then ran map-reduce job under an application using its own job specific 
> overrides, with  mapreduce.job.counters.max=10000
> All the tasks for the job finished successfully; however the overall job 
> still failed due to AM encountering exceptions as:
> {code}
> 2012-07-12 17:31:43,485 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks
> : 712012-07-12 17:31:43,502 FATAL [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher threa
> dorg.apache.hadoop.mapreduce.counters.LimitExceededException: Too many 
> counters: 1001 max=1000
>         at 
> org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:58)     
>    at org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:65)
>         at 
> org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:77)
>         at 
> org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:94)
>         at 
> org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:105)
>         at 
> org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.incrAllCounters(AbstractCounterGroup.java:202)
>         at 
> org.apache.hadoop.mapreduce.counters.AbstractCounters.incrAllCounters(AbstractCounters.java:337)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.constructFinalFullcounters(JobImpl.java:1212)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.mayBeConstructFinalFullCounters(JobImpl.java:1198)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.createJobFinishedEvent(JobImpl.java:1179)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.logJobHistoryFinishedEvent(JobImpl.java:711)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.checkJobCompleteSuccess(JobImpl.java:737)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.checkJobForCompletion(JobImpl.java:1360)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.transition(JobImpl.java:1340)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.transition(JobImpl.java:1323)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:380)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:666)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:113)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:890)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:886)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:125)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:74)   
>      at java.lang.Thread.run(Thread.java:662)
> 2012-07-12 17:31:43,502 INFO [AsyncDispatcher event handler] 
> org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye..2012-07-12 
> 17:31:43,503 INFO [Thread-1] org.apache.had
> {code}
> The overall job failed, and the job history wasn't accessible either at the 
> end of the job (didn't show up in job history server).
> We were able to workaround the issue by changing to higher limits in 
> core-site.xml and restarting yarn servers. However that forced us to increase 
> the counters global limit to be as high as possible use by any individual 
> application, which is hard to predict.
> The original job then succeeded with new global limits. 
> However, since we didn't restart the job history server, it was unable to 
> display job history page for the successful job altogether as it still hit 
> counter exceeded exception. Restart of job history server finally got the 
> application available under job history.
> I'll also attach AM logs to help debug the issue 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAPREDUCE-4443) MR AM and job history server should be resilient to jobs that exceed counter limits

Reply via email to