[
https://issues.apache.org/jira/browse/TEZ-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211885#comment-14211885
]
Jeff Zhang commented on TEZ-1744:
---------------------------------
[~hitesh] I copy the comments about commit from TEZ-1737 to this jira's
description.
So based on your explanation, my understanding is that in the scenario that
summary VertexFinishedEvent is seen but non-summary VertexFinished is not seen
and the vertex has committer, we can not recover it to SUCCEEDED, because data
movement events may be lost. But can we recover it to RUNNING and recover its
tasks , in this way the data movement events will be regenerated and also
remember the recovery is done and don't do it again. I notice we do it that way
in DAG as following code.
{code}
for (VertexGroupInfo groupInfo : commitList) {
if (recoveredGroupCommits.containsKey(groupInfo.groupName)) {
LOG.info("VertexGroup was already committed as per recovery"
+ " data, groupName=" + groupInfo.groupName);
continue;
}
{code}
Beside the committer in Vertex recovery, I also found the committer in DAG
recovery ( shown in following code ) . Is there any special reason to do that
? IMO, the only purpose is to remember whether the commit is done, and we don't
need to check whether it is in the progress of commit, because we have done
this in summary log. Just want to confirm with you before starting this jira.
{code}
boolean groupCommitInProgress = false;
if (!dag.recoveredGroupCommits.isEmpty()) {
for (Entry<String, Boolean> entry :
dag.recoveredGroupCommits.entrySet()) {
if (!entry.getValue().booleanValue()) {
LOG.info("Found a pending Vertex Group commit"
+ ", vertexGroup=" + entry.getKey());
groupCommitInProgress = true;
break;
}
}
}
{code}
> It is not necessary to check whether dag is commit in RecoveryTransition
> ------------------------------------------------------------------------
>
> Key: TEZ-1744
> URL: https://issues.apache.org/jira/browse/TEZ-1744
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.5.1
> Reporter: Jeff Zhang
>
> It is not necessary to check whether dag is commit in RecoveryTransition,
> because we already check that in RecoveryParser by using the summary event.
> Copy the comments from TEZ-1737,
> bq. But even the non-summary VertexFinishedEvent is seen, its
> VertexRecoverableEventsGeneratedEvent may still lost. I think there's no
> guaranteed that VertexRecoverableEventsGeneratedEvent is logged before
> VertexFinishedEvent.
> The expectation was that all tasks are completed before a vertex has
> finished. Also, a TaskFinishedEvent is only seen after all its datamovement
> events are generated and therefore logged.
> The handling for for the general case where there are a lot of data movement
> events generated, commit started and then ended. In a scenario, where commit
> starts but does not end, the summary log helps catch the problem. Now, in a
> scenario, where commit finished successfully, there could be a situation
> where the AM crashed before all data movements are stored to recovery. In
> this scenario, we cannot do anything as the commit has already been done but
> we have no idea what was lost. The main crux to answer your question is that
> a committer cannot be invoked twice.
> Agree that VertexRecoverableEventsGeneratedEvent is a different problem. In
> such cases, I believe that if VertexRecoverableEventsGeneratedEvent is not
> seen before a VertexFinished is seen, there needs to be some additional
> handling for that scenario too. If a VertexRecoverableEventsGeneratedEvent is
> always guaranteed to be generated for a vertex and it is not seen, then that
> means it is a potential non-recoverable case when the vertex itself was seen
> to have been completed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)