[ 
https://issues.apache.org/jira/browse/TEZ-1744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated TEZ-1744:
----------------------------
    Description: 
It is not necessary to check whether dag is commit in RecoveryTransition, 
because we already check that in RecoveryParser by using the summary event.

Copy the comments from TEZ-1737,

bq. But even the non-summary VertexFinishedEvent is seen, its 
VertexRecoverableEventsGeneratedEvent may still lost. I think there's no 
guaranteed that VertexRecoverableEventsGeneratedEvent is logged before 
VertexFinishedEvent.
The expectation was that all tasks are completed before a vertex has finished. 
Also, a TaskFinishedEvent is only seen after all its datamovement events are 
generated and therefore logged.
The handling for for the general case where there are a lot of data movement 
events generated, commit started and then ended. In a scenario, where commit 
starts but does not end, the summary log helps catch the problem. Now, in a 
scenario, where commit finished successfully, there could be a situation where 
the AM crashed before all data movements are stored to recovery. In this 
scenario, we cannot do anything as the commit has already been done but we have 
no idea what was lost. The main crux to answer your question is that a 
committer cannot be invoked twice.
Agree that VertexRecoverableEventsGeneratedEvent is a different problem. In 
such cases, I believe that if VertexRecoverableEventsGeneratedEvent is not seen 
before a VertexFinished is seen, there needs to be some additional handling for 
that scenario too. If a VertexRecoverableEventsGeneratedEvent is always 
guaranteed to be generated for a vertex and it is not seen, then that means it 
is a potential non-recoverable case when the vertex itself was seen to have 
been completed.

  was:It is not necessary to check whether dag is commit in RecoveryTransition, 
because we already check that in RecoveryParser by using the summary event.


> It is not necessary to check whether dag is commit in RecoveryTransition
> ------------------------------------------------------------------------
>
>                 Key: TEZ-1744
>                 URL: https://issues.apache.org/jira/browse/TEZ-1744
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.5.1
>            Reporter: Jeff Zhang
>
> It is not necessary to check whether dag is commit in RecoveryTransition, 
> because we already check that in RecoveryParser by using the summary event.
> Copy the comments from TEZ-1737,
> bq. But even the non-summary VertexFinishedEvent is seen, its 
> VertexRecoverableEventsGeneratedEvent may still lost. I think there's no 
> guaranteed that VertexRecoverableEventsGeneratedEvent is logged before 
> VertexFinishedEvent.
> The expectation was that all tasks are completed before a vertex has 
> finished. Also, a TaskFinishedEvent is only seen after all its datamovement 
> events are generated and therefore logged.
> The handling for for the general case where there are a lot of data movement 
> events generated, commit started and then ended. In a scenario, where commit 
> starts but does not end, the summary log helps catch the problem. Now, in a 
> scenario, where commit finished successfully, there could be a situation 
> where the AM crashed before all data movements are stored to recovery. In 
> this scenario, we cannot do anything as the commit has already been done but 
> we have no idea what was lost. The main crux to answer your question is that 
> a committer cannot be invoked twice.
> Agree that VertexRecoverableEventsGeneratedEvent is a different problem. In 
> such cases, I believe that if VertexRecoverableEventsGeneratedEvent is not 
> seen before a VertexFinished is seen, there needs to be some additional 
> handling for that scenario too. If a VertexRecoverableEventsGeneratedEvent is 
> always guaranteed to be generated for a vertex and it is not seen, then that 
> means it is a potential non-recoverable case when the vertex itself was seen 
> to have been completed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to