[
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992358#comment-14992358
]
Bikas Saha commented on TEZ-2581:
---------------------------------
I may have not been clear in explaining the vertex manager flow. If vertex is
not fully configured (e.g. numTasks -1) then vertex is always configured by VM
plugin and that eventually ends up calling setParallelism method. In this case
vertex goes into initializing state and moves to running after setParallelism.
It is optional for the VM to call vertexReconfigurationPlanned() in this case
for backwards compatibility. VM is required to call reconfigurationPlanned
(which sets vertexToBeConfiguredbyVM for the case when vertex numTasks > 0 and
it still wants to change that later on. In this case, vertex can move into
running state and then VM can change parallelism. Then, VM must call
doneReconfiguringVertex() that informs the vertex that VM is done and it can
send out the configured notification.
So if setParallelism has not been invoked, then there is no need to save any
information in the vertexReconfigurationDoneEvent. The empty event is logged to
indicate that the vertex is fully defined (in this case its the same as the
definition in the dagPlan). If setParallelism is invoked then its changes
(tasks/location/spec etc.) should be stored in the reconfigureDoneEvent.
Upon recovery - if reconfigureDoneEvent is empty - the NoOpVertexManager has
empty payload and should not invoke reconfigureVertex(). This is the case when
the VM did nothing in the first AM (e.g. ImmediateStartVertexManager). If
setParallelism was invoked, then the next question is did vertex move from -1
to numTasks or from numTasks1 to numTasks2. Like you point out, this affects
when the NoOpVM can invoke reconfigureVertex because in the first case, the
vertex will be in initializing state and we have no way to trigger
NoOpVertexManager to call reconfigureVertex and calling VM.initialize.
Solutions? The cleanest solution I can think of is for NoOpVertexManager to
always call reconfigurationPlanned() in VM.initialize(). Then check
vertex.numTasks in VM.initialize(). If numTasks < 0 then it has to fake a
trigger by setting up a timer. Upon the trigger it will call
reconfigureVertex() and reconfigureDone(). If numTasks >= 0 then do nothing.
Vertex will move to running state and call VM.start() in which it can invoke
reconfigureVertex() and reconfigureDone(). In this way, I think the main
VertexImpl flow will stay close to the normal flow. Thoughts?
Lets converge on this before changing this code again. It may be quicker to
converge on comments than in review iterations. This is complicated. Thanks for
patience through all these comments :)
> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
> Key: TEZ-2581
> URL: https://issues.apache.org/jira/browse/TEZ-2581
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-2.patch,
> TEZ-2581-WIP-3.patch, TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch,
> TEZ-2581-WIP-6.patch, TEZ-2581-WIP-7.patch, TEZ-2581-WIP-8.patch,
> TEZ-2581-WIP-9.patch, TezRecoveryRedesignProposal.pdf,
> TezRecoveryRedesignV1.1.pdf
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)