[
https://issues.apache.org/jira/browse/TEZ-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15040817#comment-15040817
]
Hitesh Shah commented on TEZ-2968:
----------------------------------
[~zjffdu] Addressed the dup log comment as well as AMWebController ( filed an
additional follow up as changing it now may break the UI ).
As for TaskImpl and TAImpl, I had looked at those and decided not to change in
this jira as it requires a lot of changes in the state machine to move the
taskattempt or task to a clean failed state. For example, a task attempt
succeeds but with counters exceeded - this should fail the task even though its
only a single attempt ran. I think I would prefer to tackle the counter limits
issue in a follow-up to more of a major change in handling this issue. Any
concerns with this? For now, this patch will cause the DAG to go into an error
state and not cause AM crashes. This is limited to the case where the
taskattempt generates exactly max limit counters and the additional counter in
the AM makes the total count go over the limit.
[~bikassaha] - the diff approaches are mainly due to the differences in DAGImpl
vs VertexImpl. The code is consistent within each class I believe. Let me know
if I missed something.
> Counter limits exception causes AM to crash
> --------------------------------------------
>
> Key: TEZ-2968
> URL: https://issues.apache.org/jira/browse/TEZ-2968
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Hitesh Shah
> Assignee: Hitesh Shah
> Priority: Critical
> Attachments: TEZ-2968.1.wip.patch, TEZ-2968.2.patch, TEZ-2968.3.patch
>
>
> On vertex or dag completion, the counter limits exception propagates to the
> Dispatcher and causes the AM to die.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)