[
https://issues.apache.org/jira/browse/TEZ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114081#comment-14114081
]
Bikas Saha commented on TEZ-1345:
---------------------------------
There are 2 alternatives
1) pessimistic - save events before starting. This delays performance. This
patch is not really achieving that.
2) optimistic - save events while starting. The only case where this wont work
if when the AM crashes immediately after.
In both cases, for now, the contract for init events is that they must be made
up-front. So its a 1 time thing. When that changes, there will need to be an
additional mechanism to notify the framework that initing is dont. And in fact
it may not be done till the last block of data gets assigned to an owner till
the very end of execution. How recovery is going to work in these cases is
still not clear though the optimistic approach still works where it works.
IMO the performance loss is probably not going to acceptable for short queries.
What we could do is add an API that allows the VertexManager to notify the
framework that it is done making updates. It could also pass along a state
payload that represents its state in case we need to restart it. That
notification could be saved in the log. If that notification is present during
recovery then we can continue to recover from where we left off and also
provide state to the VM. If that notification is not present in recovery then
we start from scratch. IMO, in 99% of the cases this should be enough. The
contract for VMs then clearly becomes, recovery works post DONE notification.
> Add checks to guarantee all init events are written to recovery to consider
> vertex initialized
> ----------------------------------------------------------------------------------------------
>
> Key: TEZ-1345
> URL: https://issues.apache.org/jira/browse/TEZ-1345
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Hitesh Shah
> Assignee: Jeff Zhang
> Attachments: Tez-1345-2.patch, Tez-1345.patch
>
>
> Related to issue discovered in TEZ-1033
--
This message was sent by Atlassian JIRA
(v6.2#6252)