Jonathan Eagles commented on TEZ-3817:

[~kshukla], couple of small things I would like to see. The overall approach 
seems in line with Jason's original review. I think that if this JIRA were 
aimed at refactoring the shutdown system, I would have suggested a 
try/catch/finally design approach to commonize the event sending logic and 
error handling. But given the scope of the JIRA and the current messiness of 
the error shutdown system (e.g. why do we fail a suceeded DAG if recovery fails 
to log), this seems like a good approach.

- Let's rename recoveryError since it is used in a more general way
- Let's log the exception in all cases (Not just catch-all non-IOException 
cases) so we get the stack trace of what happened in every case.
- Let's change the LOG.info call to LOG.warn (or maybe even ERROR?) to 
emphasize in the log where the bad things are happening
- The tests look like they are passing, but it looks like it is throwing an 
NPE. Let's keep the test going through the logUnsuccessful History path, but we 
need to ensure 1) path was taken and 2) final state is correct

> DAGs can hang after more than one uncaught Exception during doTransition.
> -------------------------------------------------------------------------
>                 Key: TEZ-3817
>                 URL: https://issues.apache.org/jira/browse/TEZ-3817
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.1, 0.9.0
>            Reporter: Kuhu Shukla
>            Assignee: Kuhu Shukla
>            Priority: Major
>         Attachments: TEZ-3817.001.patch, TEZ-3817.002.patch, 
> TEZ-3817.003.patch, TEZ-3817.test.patch
> A Tez DAG can hang in the last "sane" state if the 
> statemachine.doTransition() throws a runtime exception more than once. The 
> transition for the Error state itself throws an exception, the DAG hangs. 

This message was sent by Atlassian JIRA

Reply via email to