[
https://issues.apache.org/jira/browse/TEZ-1895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14298522#comment-14298522
]
Jeff Zhang commented on TEZ-1895:
---------------------------------
bq. Was this added because the new tests revealed the bug that vertex would not
complete because of the counting error?
Yes. otherwise, the dag won't finish.
bq. INVALID_RERUN -> VERTEX_RERUN_AFTER_COMMIT???
Done
bq. Maybe add the diagnostic when the failure is triggered in
vertexReRunning() rather than inside checkForCompletion()?
bq. The diagnostic is less informative than the log. Can we get the vertex
information in the diagnostic?
Suppose these 2 things are the same thing, add diagnostics in the
vertexReRunning()
bq. Perhaps in a separate jira we should rename TaskAttemptTerminationCause to
FailureReason and consolidate DAGTerminationCause and VertexTerminationCause
into it. Currently there is too much duplication and essentially we are only
looking for a programmatic enum for a common set of failure reasons.
I have a impression there may be one jira for the consolidation of termination
cause, but don't remember the jira number.
> Vertex reRunning should decrease successfulMembers of VertexGroupInfo
> ---------------------------------------------------------------------
>
> Key: TEZ-1895
> URL: https://issues.apache.org/jira/browse/TEZ-1895
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-1895-1.patch, TEZ-1895-2.patch, TEZ-1895-3.patch,
> TEZ-1895-4.patch
>
>
> Vertex reRunning should decrease successfulMembers of VertexGroupInfo,
> otherwise commit may happen when vertex is still in rerunning.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)