[
https://issues.apache.org/jira/browse/TEZ-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298403#comment-14298403
]
Bikas Saha commented on TEZ-1895:
---------------------------------
Was this added because the new tests revealed the bug that vertex would not
complete because of the counting error? Maybe add the diagnostic when the
failure is triggered in vertexReRunning() rather than inside
checkForCompletion()?
{code}+ if (!failed) {
+ job.numCompletedVertices--;
+ }{code}
INVALID_RERUN -> VERTEX_RERUN_AFTER_COMMIT???
The diagnostic is less informative than the log. Can we get the vertex
information in the diagnostic?
{code}
+ if(dag.terminationCause == DAGTerminationCause.INVALID_RERUN ){
+ String diagnosticMsg = "DAG failed due to invalid rerun." +
+ " failedVertices:" + dag.numFailedVertices +
+ " killedVertices:" + dag.numKilledVertices;
+ LOG.info(diagnosticMsg);
{code}
{code}
LOG.info("Aborting job as committed vertex: "
+ vertex.getLogIdentifier() + " is re-running");{code}
Perhaps in a separate jira we should rename TaskAttemptTerminationCause to
FailureReason and consolidate DAGTerminationCause and VertexTerminationCause
into it. Currently there is too much duplication and essentially we are only
looking for a programmatic enum for a common set of failure reasons.
> Vertex reRunning should decrease successfulMembers of VertexGroupInfo
> ---------------------------------------------------------------------
>
> Key: TEZ-1895
> URL: https://issues.apache.org/jira/browse/TEZ-1895
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-1895-1.patch, TEZ-1895-2.patch, TEZ-1895-3.patch
>
>
> Vertex reRunning should decrease successfulMembers of VertexGroupInfo,
> otherwise commit may happen when vertex is still in rerunning.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)