[ 
https://issues.apache.org/jira/browse/TEZ-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13946100#comment-13946100
 ] 

Siddharth Seth commented on TEZ-973:
------------------------------------

{code}
if (foundPreviousAttempt == -1) {
      LOG.info("Falling back to first attempt as no other recovered attempts"
          + " found");
      foundPreviousAttempt = 1;
    }
{code}
Is this valid ? This directory would already have been checked during 
iteration. Is this the case where an AM ran without actually executing any DAGs 
?

{code}
 .addTransition(DAGState.NEW, DAGState.KILLED,
              DAGEventType.DAG_KILL,
              new KillNewJobTransition())
{code}
Does this need to have ERROR as a potential state ? Similarly DAG_KILL on INITED

Also at 
{code}
.addTransition
              (DAGState.RUNNING,
              EnumSet.of(DAGState.RUNNING, DAGState.SUCCEEDED, 
DAGState.TERMINATING,DAGState.FAILED),
              DAGEventType.DAG_VERTEX_COMPLETED,
{code}

Should the Vertex be put into ERROR state if a VERTEX_FINISHED persist fails ?

{code}
public static String RECOVERY_FATAL_OCCURRED_DIR
{code}
Should be final.

In RecoveryService
{code}
          Path fatalErrorDir = new Path(recoveryPath, 
RECOVERY_FATAL_OCCURRED_DIR);
          try {
            recoveryDirFS.mkdirs(fatalErrorDir);
          } catch (IOException e) {
            LOG.error("Failed to create fatal error flag dir "
                + fatalErrorDir.toString(), e);
          }
          throw ioe;
{code}
Should the IOException only be thrown if there's a failure to create the 
RECOVERY_FATAL_OCCURRED_DIR ?, and otherwise just log a warning that no 
subsequent events will be retried. Also, don't attempt writing any additional 
recovery information at this point and just allow the current DAG to run 
through.

Looking at other invocations of handleCriticalEvent - should 
VertexCommitSummaryEvent failure also put the DAG into an ERROR state, and 
avoid retries ?

Are the OrderedWordCount changes related - especially the log about 
'retainStagingDir'.





> Abort additional attempts if recovery fails.
> --------------------------------------------
>
>                 Key: TEZ-973
>                 URL: https://issues.apache.org/jira/browse/TEZ-973
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Hitesh Shah
>            Assignee: Hitesh Shah
>         Attachments: TEZ-973.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to