[
https://issues.apache.org/jira/browse/TEZ-3817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428903#comment-16428903
]
Jonathan Eagles commented on TEZ-3817:
--------------------------------------
[~kshukla], couple of small things I would like to see. The overall approach
seems in line with Jason's original review. I think that if this JIRA were
aimed at refactoring the shutdown system, I would have suggested a
try/catch/finally design approach to commonize the event sending logic and
error handling. But given the scope of the JIRA and the current messiness of
the error shutdown system (e.g. why do we fail a suceeded DAG if recovery fails
to log), this seems like a good approach.
- Let's rename recoveryError since it is used in a more general way
- Let's log the exception in all cases (Not just catch-all non-IOException
cases) so we get the stack trace of what happened in every case.
- Let's change the LOG.info call to LOG.warn (or maybe even ERROR?) to
emphasize in the log where the bad things are happening
- The tests look like they are passing, but it looks like it is throwing an
NPE. Let's keep the test going through the logUnsuccessful History path, but we
need to ensure 1) path was taken and 2) final state is correct
> DAGs can hang after more than one uncaught Exception during doTransition.
> -------------------------------------------------------------------------
>
> Key: TEZ-3817
> URL: https://issues.apache.org/jira/browse/TEZ-3817
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.1, 0.9.0
> Reporter: Kuhu Shukla
> Assignee: Kuhu Shukla
> Priority: Major
> Attachments: TEZ-3817.001.patch, TEZ-3817.002.patch,
> TEZ-3817.003.patch, TEZ-3817.test.patch
>
>
> A Tez DAG can hang in the last "sane" state if the
> statemachine.doTransition() throws a runtime exception more than once. The
> transition for the Error state itself throws an exception, the DAG hangs.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)