[
https://issues.apache.org/jira/browse/TEZ-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120399#comment-14120399
]
Bikas Saha commented on TEZ-1345:
---------------------------------
Here is a summary from an offline discussion with Hitesh.
The root cause of the issue is the inherent race condition in the flow (II is
input initializer, VM is Vertex Manager)
1) Vertex starts IIs
2) IIs sends Vertex events when they are done
3) Vertex forwards events to VM and changes state to INITED
4) VM forwards events back the Vertex (potentially after changing some things).
But by that time the Vertex has already INITED.
This race condition already existed but was never a problem until recovery came
into the picture.
The ideal solution is to remove this race condition. An option for that is
TEZ-703 that aims to remove II control to the VM. So instead of the Vertex
starting IIs, the vertex starts the VM, VM starts the IIs. IIs send their
events back to VM directly. VM sends final events (after modifying them if
needed) to the Vertex via InputInitDone notification. At this point the vertex
knows the final events and can change change to INITED. This greatly simplifies
the Vertex state machine and also removes the race condition. Nothing
materially changes in the IIPlugin or the VMPlugin and so it should be
backwards compatible.
The other stop-gap solution is to have
vertex.vertexManager.onRootVertexInitialized() return the init events in the
return value. This way the init events can be logged before the transition
completes. In order to do this compatibly, VM.addRootInputEvents() could cache
the events instead of sending them via the dispatcher and return the cached
value in the return of onRootVertexInitialized(). This is similar to the inline
event routing patch except that it does not leak event routing logic outside of
the VertexImpl code.
We can evaluate the effort and risk of TEZ-703 and if its too much we can do
the stop gap solution in the interim.
> Add checks to guarantee all init events are written to recovery to consider
> vertex initialized
> ----------------------------------------------------------------------------------------------
>
> Key: TEZ-1345
> URL: https://issues.apache.org/jira/browse/TEZ-1345
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Hitesh Shah
> Assignee: Jeff Zhang
> Attachments: Tez-1345-2.patch, Tez-1345-3.patch, Tez-1345-4.patch,
> Tez-1345-5.patch, Tez-1345-6.patch, Tez-1345.patch
>
>
> Related to issue discovered in TEZ-1033
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)