[ https://issues.apache.org/jira/browse/TEZ-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13946100#comment-13946100 ]
Siddharth Seth commented on TEZ-973: ------------------------------------ {code} if (foundPreviousAttempt == -1) { LOG.info("Falling back to first attempt as no other recovered attempts" + " found"); foundPreviousAttempt = 1; } {code} Is this valid ? This directory would already have been checked during iteration. Is this the case where an AM ran without actually executing any DAGs ? {code} .addTransition(DAGState.NEW, DAGState.KILLED, DAGEventType.DAG_KILL, new KillNewJobTransition()) {code} Does this need to have ERROR as a potential state ? Similarly DAG_KILL on INITED Also at {code} .addTransition (DAGState.RUNNING, EnumSet.of(DAGState.RUNNING, DAGState.SUCCEEDED, DAGState.TERMINATING,DAGState.FAILED), DAGEventType.DAG_VERTEX_COMPLETED, {code} Should the Vertex be put into ERROR state if a VERTEX_FINISHED persist fails ? {code} public static String RECOVERY_FATAL_OCCURRED_DIR {code} Should be final. In RecoveryService {code} Path fatalErrorDir = new Path(recoveryPath, RECOVERY_FATAL_OCCURRED_DIR); try { recoveryDirFS.mkdirs(fatalErrorDir); } catch (IOException e) { LOG.error("Failed to create fatal error flag dir " + fatalErrorDir.toString(), e); } throw ioe; {code} Should the IOException only be thrown if there's a failure to create the RECOVERY_FATAL_OCCURRED_DIR ?, and otherwise just log a warning that no subsequent events will be retried. Also, don't attempt writing any additional recovery information at this point and just allow the current DAG to run through. Looking at other invocations of handleCriticalEvent - should VertexCommitSummaryEvent failure also put the DAG into an ERROR state, and avoid retries ? Are the OrderedWordCount changes related - especially the log about 'retainStagingDir'. > Abort additional attempts if recovery fails. > -------------------------------------------- > > Key: TEZ-973 > URL: https://issues.apache.org/jira/browse/TEZ-973 > Project: Apache Tez > Issue Type: Bug > Reporter: Hitesh Shah > Assignee: Hitesh Shah > Attachments: TEZ-973.1.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)