Rahul Jain created MAPREDUCE-4443:
-------------------------------------

             Summary: Yarn framework components (AM, job history server) should 
be resilient to applications exceeding counter limits 
                 Key: MAPREDUCE-4443
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4443
             Project: Hadoop Map/Reduce
          Issue Type: Bug
    Affects Versions: 2.0.0-alpha
            Reporter: Rahul Jain


We saw this problem migrating applications to MapReduceV2:

Our applications use hadoop counters extensively (1000+ counters for certain 
jobs). While this may not be one of recommended best practices in hadoop, the 
real issue here is reliability of the framework when applications exceed 
counter limits.

The hadoop servers (yarn, history server) were originally brought up with 
mapreduce.job.counters.max=1000 under core-site.xml

We then ran map-reduce job under an application using its own job specific 
overrides, with  mapreduce.job.counters.max=10000

All the tasks for the job finished successfully; however the overall job still 
failed due to AM encountering exceptions as:

{code}
2012-07-12 17:31:43,485 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: Num completed Tasks
: 712012-07-12 17:31:43,502 FATAL [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher threa
dorg.apache.hadoop.mapreduce.counters.LimitExceededException: Too many 
counters: 1001 max=1000
        at 
org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:58)       
 at org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:65)
        at 
org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:77)
        at 
org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:94)
        at 
org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:105)
        at 
org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.incrAllCounters(AbstractCounterGroup.java:202)
        at 
org.apache.hadoop.mapreduce.counters.AbstractCounters.incrAllCounters(AbstractCounters.java:337)
        at 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.constructFinalFullcounters(JobImpl.java:1212)
        at 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.mayBeConstructFinalFullCounters(JobImpl.java:1198)
        at 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.createJobFinishedEvent(JobImpl.java:1179)
        at 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.logJobHistoryFinishedEvent(JobImpl.java:711)
        at 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.checkJobCompleteSuccess(JobImpl.java:737)
        at 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.checkJobForCompletion(JobImpl.java:1360)
        at 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.transition(JobImpl.java:1340)
        at 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$TaskCompletedTransition.transition(JobImpl.java:1323)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:380)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
        at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
        at 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:666)
        at 
org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:113)
        at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:890)
        at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:886)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:125)
        at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:74)     
   at java.lang.Thread.run(Thread.java:662)
2012-07-12 17:31:43,502 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye..2012-07-12 
17:31:43,503 INFO [Thread-1] org.apache.had
{code}

The overall job failed, and the job history wasn't accessible either at the end 
of the job (didn't show up in job history server).

We were able to workaround the issue by changing to higher limits in 
core-site.xml and restarting yarn servers. However that forced us to increase 
the counters global limit to be as high as possible use by any individual 
application, which is hard to predict.

The original job then succeeded with new global limits. 

However, since we didn't restart the job history server, it was unable to 
display job history page for the successful job altogether as it still hit 
counter exceeded exception. Restart of job history server finally got the 
application available under job history.

I'll also attach AM logs to help debug the issue 



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to