[ 
https://issues.apache.org/jira/browse/TEZ-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298403#comment-14298403
 ] 

Bikas Saha commented on TEZ-1895:
---------------------------------

Was this added because the new tests revealed the bug that vertex would not 
complete because of the counting error? Maybe add the diagnostic when the 
failure is triggered in vertexReRunning() rather than inside 
checkForCompletion()?
{code}+      if (!failed) {
+        job.numCompletedVertices--;
+      }{code}

INVALID_RERUN -> VERTEX_RERUN_AFTER_COMMIT???

The diagnostic is less informative than the log. Can we get the vertex 
information in the diagnostic? 
{code}
+      if(dag.terminationCause == DAGTerminationCause.INVALID_RERUN ){
+        String diagnosticMsg = "DAG failed due to invalid rerun." +
+            " failedVertices:" + dag.numFailedVertices +
+            " killedVertices:" + dag.numKilledVertices;
+        LOG.info(diagnosticMsg);
{code}
{code}
             LOG.info("Aborting job as committed vertex: "
                 + vertex.getLogIdentifier() + " is re-running");{code}

Perhaps in a separate jira we should rename TaskAttemptTerminationCause to 
FailureReason and consolidate DAGTerminationCause and VertexTerminationCause 
into it. Currently there is too much duplication and essentially we are only 
looking for a programmatic enum for a common set of failure reasons.

> Vertex reRunning should decrease successfulMembers of VertexGroupInfo
> ---------------------------------------------------------------------
>
>                 Key: TEZ-1895
>                 URL: https://issues.apache.org/jira/browse/TEZ-1895
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-1895-1.patch, TEZ-1895-2.patch, TEZ-1895-3.patch
>
>
> Vertex reRunning should decrease successfulMembers of VertexGroupInfo, 
> otherwise commit may happen when vertex is still in rerunning. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to