[ 
https://issues.apache.org/jira/browse/TEZ-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212422#comment-14212422
 ] 

Hitesh Shah commented on TEZ-1744:
----------------------------------

The commit logic within DAG ( for end of DAG commits or VertexGroups ) is 
slightly different as these commits are only invoked after a vertex or a group 
of vertices have finished. Therefore, all tasks' completions and events are 
already guaranteed to be available if a full VertexFinishedEvent is seen.

In the above case, the only thing to worry about whether there was a commit in 
progress. This is also made simpler in the case when all committers are invoked 
at the end of the DAG.

However, there may be a bug in the case of a VertexGroup that is committed 
"immediately" ( not at end of DAG ) - in this scenario, if the vertex group 
commit has been seen to have completed but the respective vertex completion is 
not seen, that would be a problem. 

bq. But can we recover it to RUNNING and recover its tasks , in this way the 
data movement events will be regenerated and also remember the recovery is done 
and don't do it again. 

The problem with the above is that when tasks re-run, there is no guarantee 
that they will regenerate the exact same data. Also, in this case, we need to 
be able to invoke the committer to abort the data from the tasks that have been 
re-run. In some cases, if a committer is committing a transaction to a DB, in 
such scenarios, these cannot be recovered at all. 






> It is not necessary to check whether dag is commit in RecoveryTransition
> ------------------------------------------------------------------------
>
>                 Key: TEZ-1744
>                 URL: https://issues.apache.org/jira/browse/TEZ-1744
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.5.1
>            Reporter: Jeff Zhang
>
> It is not necessary to check whether dag is commit in RecoveryTransition, 
> because we already check that in RecoveryParser by using the summary event.
> Copy the comments from TEZ-1737,
> bq. But even the non-summary VertexFinishedEvent is seen, its 
> VertexRecoverableEventsGeneratedEvent may still lost. I think there's no 
> guaranteed that VertexRecoverableEventsGeneratedEvent is logged before 
> VertexFinishedEvent.
> The expectation was that all tasks are completed before a vertex has 
> finished. Also, a TaskFinishedEvent is only seen after all its datamovement 
> events are generated and therefore logged.
> The handling for for the general case where there are a lot of data movement 
> events generated, commit started and then ended. In a scenario, where commit 
> starts but does not end, the summary log helps catch the problem. Now, in a 
> scenario, where commit finished successfully, there could be a situation 
> where the AM crashed before all data movements are stored to recovery. In 
> this scenario, we cannot do anything as the commit has already been done but 
> we have no idea what was lost. The main crux to answer your question is that 
> a committer cannot be invoked twice.
> Agree that VertexRecoverableEventsGeneratedEvent is a different problem. In 
> such cases, I believe that if VertexRecoverableEventsGeneratedEvent is not 
> seen before a VertexFinished is seen, there needs to be some additional 
> handling for that scenario too. If a VertexRecoverableEventsGeneratedEvent is 
> always guaranteed to be generated for a vertex and it is not seen, then that 
> means it is a potential non-recoverable case when the vertex itself was seen 
> to have been completed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to