[ 
https://issues.apache.org/jira/browse/TEZ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114081#comment-14114081
 ] 

Bikas Saha commented on TEZ-1345:
---------------------------------

There are 2 alternatives 
1) pessimistic - save events before starting. This delays performance. This 
patch is not really achieving that. 
2) optimistic - save events while starting. The only case where this wont work 
if when the AM crashes immediately after.
In both cases, for now, the contract for init events is that they must be made 
up-front. So its a 1 time thing. When that changes, there will need to be an 
additional mechanism to notify the framework that initing is dont. And in fact 
it may not be done till the last block of data gets assigned to an owner till 
the very end of execution. How recovery is going to work in these cases is 
still not clear though the optimistic approach still works where it works.
IMO the performance loss is probably not going to acceptable for short queries.
What we could do is add an API that allows the VertexManager to notify the 
framework that it is done making updates. It could also pass along a state 
payload that represents its state in case we need to restart it. That 
notification could be saved in the log. If that notification is present during 
recovery then we can continue to recover from where we left off and also 
provide state to the VM. If that notification is not present in recovery then 
we start from scratch. IMO, in 99% of the cases this should be enough. The 
contract for VMs then clearly becomes, recovery works post DONE notification.

> Add checks to guarantee all init events are written to recovery to consider 
> vertex initialized
> ----------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1345
>                 URL: https://issues.apache.org/jira/browse/TEZ-1345
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Hitesh Shah
>            Assignee: Jeff Zhang
>         Attachments: Tez-1345-2.patch, Tez-1345.patch
>
>
> Related to issue discovered in TEZ-1033



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to