[ 
https://issues.apache.org/jira/browse/TEZ-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15257001#comment-15257001
 ] 

Eric Badger commented on TEZ-3213:
----------------------------------

[~hitesh], I don't believe that this same cascading failure will occur in 
DAGImpl, TaskImpl, or TaskAttemptImpl. The failure occurred because there was a 
missing state transition from the RECOVERING state into the ERROR state due to 
V_INTERNAL_ERROR. All other states in VertexImpl are covered when dealing with 
a V_INTERNAL_ERROR event. 

Before the transition was added, a failure would occur and trigger a 
V_INTERNAL_ERROR event, which would start a transition. But, the state machine 
didn't know how to handle that event while in the RECOVERING state, because the 
transition wasn't defined. This, in turn, caused another V_INTERNAL_ERROR event 
to be created, because of the missing transition. This would keep going, which 
is what caused the failure message looping. 

DAGImpl handles all of the state transitions when an INTERNAL_ERROR event is 
presented, so there is no issue there. And from what I can tell, this sort of 
internal error event does not exist in TaskImpl or TaskAttemptImpl. So I think 
all of our bases are covered. 

> Uncaught exception during vertex recovery leads to invalid state transition 
> loop
> --------------------------------------------------------------------------------
>
>                 Key: TEZ-3213
>                 URL: https://issues.apache.org/jira/browse/TEZ-3213
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jason Lowe
>            Assignee: Eric Badger
>         Attachments: TEZ-3213-b0.7.001.patch
>
>
> If an uncaught exception occurs during a state transition from the RECOVERING 
> vertex then V_INTERNAL_ERROR will be delivered to the state machine, but that 
> event is not handled in the RECOVERING state.  That in turn causes a 
> V_INTERNAL_ERROR event to be delivered to the state machine, and it loops 
> logging the invalid transitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to