[
https://issues.apache.org/jira/browse/TEZ-714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14446581#comment-14446581
]
Bikas Saha commented on TEZ-714:
--------------------------------
bq. Both task finishing and commit finishing may be the last step to the
finished state (if no commit then task finishing is the last step, and
committing finish is the last step if there's commit), so they would both
trigger checkForCompletion.
Today, the semantics of checkForCompletion() are actually
checkForTasksCompletion(). It checks if all tasks have succeeded. On success,
it would complete all commits synchronously and return the final state
SUCCEEDED. This patch is overloading checkForCompletion() to do both
checkForTasksCompletion() and checkForCommitsCompletion() resulting in logic
overloading that is error-prone and makes future changes harder because the
method is doing 2 things.
{code}
+ LOG.info("All tasks are succeeded, vertex:" + vertex.logIdentifier);
+ if (!vertex.commitVertexOutputs) {
+ // just finish because no vertex committing needed
+ return vertex.finished(VertexState.SUCCEEDED);
+ } else if (!vertex.committed.getAndSet(true)) {
+ // start commit if there're commits or just finish if no commits
+ return commitOrFinish(vertex);
// This part belongs to checkForCommitsCompletion()
+ } else if (vertex.commitFutures.isEmpty()) {
+ // move from COMMITTING to SUCCEEDED
+ return vertex.finished(VertexState.SUCCEEDED);
+ } else {
+ return VertexState.COMMITTING;
}{code}
bq. What i am doing is not exactly this way. I won't ignore failed commits,
stead I will cancel pending commits and wait for them to complete and then move
to failed state.
Will calling cancel on the future object is going to result in
VertexCommitCallback#onFailure() to be invoked. If not, then the vertex will
hang on waiting for the commitFutures to be empty because no CommitCompleted
event will come.
bq. This behavior is consistent with other cases when there's any fail event
happens (commit fail, vertex termination event and etc ), all the cases would
cancel pending commits and wait for them to complete and then move to finished
state.
Is this existing behavior or behavior in the patch? In either case, this logic
is not consistent from a users point of view. Depending on which async commit
operation ran first on the threadpool and failed, the user will see anywhere
between 1 to N-1 committed outputs. Is that observation correct? If yes, is
that a better choice than saying - User will see either no outputs or all
successfully committed outputs?
> OutputCommitters should not run in the main AM dispatcher thread
> ----------------------------------------------------------------
>
> Key: TEZ-714
> URL: https://issues.apache.org/jira/browse/TEZ-714
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Siddharth Seth
> Assignee: Jeff Zhang
> Priority: Critical
> Attachments: DAG_2.pdf, TEZ-714-1.patch, TEZ-714-2.patch,
> TEZ-714-3.patch, TEZ-714-4.patch, TEZ-714-5.patch, TEZ-714-6.patch,
> TEZ-714-7.patch, Vertex_2.pdf
>
>
> Follow up jira from TEZ-41.
> 1) If there's multiple OutputCommitters on a Vertex, they can be run in
> parallel.
> 2) Running an OutputCommitter in the main thread blocks all other event
> handling, w.r.t the DAG, and causes the event queue to back up.
> 3) This should also cover shared commits that happen in the DAG.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)