[
https://issues.apache.org/jira/browse/TEZ-2262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14485181#comment-14485181
]
Jeff Zhang commented on TEZ-2262:
---------------------------------
Find that the counter limit exceed on the TezChild will fail the task attempt,
so if we don't fail the AM if the counter limit exceed on AM, would that cause
inconsistent between AM and TezChild ? Of course we can don't fail TaskAttempt
if the counter exceed on TezChild, but that would cause lots of changes,
because Counter is used in everywhere (I/P/O).
Not sure how MapReduce handle the counter limit exceed issue, maybe we can
borrow some idea from MapReduce. [~hitesh] Thoughts ?
> DAG/Tasks should not fail if counter limits are exceeded.
> ---------------------------------------------------------
>
> Key: TEZ-2262
> URL: https://issues.apache.org/jira/browse/TEZ-2262
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.5.0
> Reporter: Mostafa Mokhtar
>
> Running TPC-DS Q64 failed due to exceeding the max number of counters.
> DAG should succeed and include a warning in the diagnostics stating that the
> error got truncated.
> {code}
> 18043560327-2015-04-01 16:23:08,509 INFO [AsyncDispatcher event handler]
> impl.DAGImpl: No output committers for vertex: Reducer 9
> 18043560445-2015-04-01 16:23:08,857 FATAL [AsyncDispatcher event handler]
> event.AsyncDispatcher: Error in dispatcher thread
> 18043560557:org.apache.tez.common.counters.LimitExceededException: Too many
> counters: 1201 max=1200
> 18043560645- at
> org.apache.tez.common.counters.Limits.checkCounters(Limits.java:87)
> 18043560717- at
> org.apache.tez.common.counters.Limits.incrCounters(Limits.java:94)
> 18043560788- at
> org.apache.tez.common.counters.AbstractCounterGroup.addCounter(AbstractCounterGroup.java:75)
> 18043560885- at
> org.apache.tez.common.counters.AbstractCounterGroup.addCounterImpl(AbstractCounterGroup.java:92)
> 18043560986- at
> org.apache.tez.common.counters.AbstractCounterGroup.findCounter(AbstractCounterGroup.java:103)
> 18043561085- at
> org.apache.tez.common.counters.AbstractCounterGroup.incrAllCounters(AbstractCounterGroup.java:198)
> 18043561188- at
> org.apache.tez.common.counters.AbstractCounters.incrAllCounters(AbstractCounters.java:363)
> 18043561283- at
> org.apache.tez.dag.app.dag.impl.DAGImpl.incrTaskCounters(DAGImpl.java:598)
> 18043561362- at
> org.apache.tez.dag.app.dag.impl.DAGImpl.getAllCounters(DAGImpl.java:588)
> 18043561439- at
> org.apache.tez.dag.app.dag.impl.DAGImpl.logJobHistoryFinishedEvent(DAGImpl.java:994)
> 18043561528- at
> org.apache.tez.dag.app.dag.impl.DAGImpl.finished(DAGImpl.java:1135)
> 18043561600- at
> org.apache.tez.dag.app.dag.impl.DAGImpl.checkDAGForCompletion(DAGImpl.java:1048)
> 18043561685- at
> org.apache.tez.dag.app.dag.impl.DAGImpl$VertexCompletedTransition.transition(DAGImpl.java:1708)
> 18043561785- at
> org.apache.tez.dag.app.dag.impl.DAGImpl$VertexCompletedTransition.transition(DAGImpl.java:1665)
> 18043561885- at
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> 18043562001- at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> 18043562097- at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> 18043562190- at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> 18043562307- at
> org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:944)
> 18043562376- at
> org.apache.tez.dag.app.dag.impl.DAGImpl.handle(DAGImpl.java:126)
> 18043562445- at
> org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1686)
> 18043562535- at
> org.apache.tez.dag.app.DAGAppMaster$DagEventDispatcher.handle(DAGAppMaster.java:1677)
> 18043562625- at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> 18043562709- at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> 18043562790- at java.lang.Thread.run(Thread.java:745)
> 18043562832-2015-04-01 16:23:08,882 INFO [AsyncDispatcher event handler]
> event.AsyncDispatcher: Exiting, bbye..
> 18043562932-2015-04-01 16:23:08,885 INFO [Thread-1] app.DAGAppMaster:
> DAGAppMasterShutdownHook invoked
> 18043563023-2015-04-01 16:23:08,885 INFO [Thread-1] app.DAGAppMaster:
> DAGAppMaster received a signal. Signaling TaskScheduler
> 18043563137-2015-04-01 16:23:08,885 INFO [Thread-1]
> rm.TaskSchedulerEventHandler: TaskScheduler notified that iSignalled was :
> true
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)