[
https://issues.apache.org/jira/browse/MAPREDUCE-4443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinod Kumar Vavilapalli updated MAPREDUCE-4443:
-----------------------------------------------
Labels: usability (was: )
> MR AM and job history server should be resilient to jobs that exceed counter
> limits
> ------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-4443
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4443
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Affects Versions: 2.0.0-alpha
> Reporter: Rahul Jain
> Labels: usability
> Attachments: am_failed_counter_limits.txt
>
>
> We saw this problem migrating applications to MapReduceV2:
> Our applications use hadoop counters extensively (1000+ counters for certain
> jobs). While this may not be one of recommended best practices in hadoop, the
> real issue here is reliability of the framework when applications exceed
> counter limits.
> The hadoop servers (yarn, history server) were originally brought up with
> mapreduce.job.counters.max=1000 under core-site.xml
> We then ran map-reduce job under an application using its own job specific
> overrides, with mapreduce.job.counters.max=10000
> All the tasks for the job finished successfully; however the overall job
> still failed due to AM encountering exceptions as:
> {code}
> 2012-07-12 17:31:43,485 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks
> : 712012-07-12 17:31:43,502 FATAL [AsyncDispatcher event handler]
> org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher threa
> dorg.apache.hadoop.mapreduce.counters.LimitExceededException: Too many
> counters: 1001 max=1000
> at
> org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:58)
> at org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:65)
> at
> org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:77)
> at
> org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:94)
> at
> org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:105)
> at
> org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.incrAllCounters(AbstractCounterGroup.java:202)
> at
> org.apache.hadoop.mapreduce.counters.AbstractCounters.incrAllCounters(AbstractCounters.java:337)
> at
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.constructFinalFullcounters(JobImpl.java:1212)
> at
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.mayBeConstructFinalFullCounters(JobImpl.java:1198)
> at
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.createJobFinishedEvent(JobImpl.java:1179)
> at
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.logJobHistoryFinishedEvent(JobImpl.java:711)
> at
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.checkJobCompleteSuccess(JobImpl.java:737)
> at
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.checkJobForCompletion(JobImpl.java:1360)
> at
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.transition(JobImpl.java:1340)
> at
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.transition(JobImpl.java:1323)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:380)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
> at
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:666)
> at
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:113)
> at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:890)
> at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:886)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:125)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:74)
> at java.lang.Thread.run(Thread.java:662)
> 2012-07-12 17:31:43,502 INFO [AsyncDispatcher event handler]
> org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye..2012-07-12
> 17:31:43,503 INFO [Thread-1] org.apache.had
> {code}
> The overall job failed, and the job history wasn't accessible either at the
> end of the job (didn't show up in job history server).
> We were able to workaround the issue by changing to higher limits in
> core-site.xml and restarting yarn servers. However that forced us to increase
> the counters global limit to be as high as possible use by any individual
> application, which is hard to predict.
> The original job then succeeded with new global limits.
> However, since we didn't restart the job history server, it was unable to
> display job history page for the successful job altogether as it still hit
> counter exceeded exception. Restart of job history server finally got the
> application available under job history.
> I'll also attach AM logs to help debug the issue
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira