[ 
https://issues.apache.org/jira/browse/TEZ-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15040817#comment-15040817
 ] 

Hitesh Shah commented on TEZ-2968:
----------------------------------

[~zjffdu] Addressed the dup log comment as well as AMWebController ( filed an 
additional follow up as changing it now may break the UI ).

As for TaskImpl and TAImpl, I had looked at those and decided not to change in 
this jira as it requires a lot of changes in the state machine to move the 
taskattempt or task to a clean failed state. For example, a task attempt 
succeeds but with counters exceeded - this should fail the task even though its 
only a single attempt ran. I think I would prefer to tackle the counter limits 
issue in a follow-up to more of a major change in handling this issue. Any 
concerns with this? For now, this patch will cause the DAG to go into an error 
state and not cause AM crashes. This is limited to the case where the 
taskattempt generates exactly max limit counters and the additional counter in 
the AM makes the total count go over the limit. 

[~bikassaha] - the diff approaches are mainly due to the differences in DAGImpl 
vs VertexImpl. The code is consistent within each class I believe. Let me know 
if I missed something. 

 




> Counter limits exception causes AM to crash 
> --------------------------------------------
>
>                 Key: TEZ-2968
>                 URL: https://issues.apache.org/jira/browse/TEZ-2968
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Hitesh Shah
>            Assignee: Hitesh Shah
>            Priority: Critical
>         Attachments: TEZ-2968.1.wip.patch, TEZ-2968.2.patch, TEZ-2968.3.patch
>
>
> On vertex or dag completion, the counter limits exception propagates to the 
> Dispatcher and causes the AM to die. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to