[jira] [Commented] (TEZ-2581) Umbrella for Tez Recovery Redesign

Bikas Saha (JIRA) Thu, 05 Nov 2015 11:57:49 -0800

    [ 
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992358#comment-14992358
 ]


Bikas Saha commented on TEZ-2581:
---------------------------------

I may have not been clear in explaining the vertex manager flow. If vertex is 
not fully configured (e.g. numTasks -1) then vertex is always configured by VM 
plugin and that eventually ends up calling setParallelism method. In this case 
vertex goes into initializing state and moves to running after setParallelism. 
It is optional for the VM to call vertexReconfigurationPlanned() in this case 
for backwards compatibility. VM is required to call reconfigurationPlanned 
(which sets vertexToBeConfiguredbyVM for the case when vertex numTasks > 0 and 
it still wants to change that later on. In this case, vertex can move into 
running state and then VM can change parallelism. Then, VM must call 
doneReconfiguringVertex() that informs the vertex that VM is done and it can 
send out the configured notification.
So if setParallelism has not been invoked, then there is no need to save any 
information in the vertexReconfigurationDoneEvent. The empty event is logged to 
indicate that the vertex is fully defined (in this case its the same as the 
definition in the dagPlan). If setParallelism is invoked then its changes 
(tasks/location/spec etc.) should be stored in the reconfigureDoneEvent.
Upon recovery - if reconfigureDoneEvent is empty - the NoOpVertexManager has 
empty payload and should not invoke reconfigureVertex(). This is the case when 
the VM did nothing in the first AM (e.g. ImmediateStartVertexManager). If 
setParallelism was invoked, then the next question is did vertex move from -1 
to numTasks or from numTasks1 to numTasks2. Like you point out, this affects 
when the NoOpVM can invoke reconfigureVertex because in the first case, the 
vertex will be in initializing state and we have no way to trigger 
NoOpVertexManager to call reconfigureVertex and calling VM.initialize. 
Solutions? The cleanest solution I can think of is for NoOpVertexManager to 
always call reconfigurationPlanned() in VM.initialize(). Then check 
vertex.numTasks in VM.initialize(). If numTasks < 0 then it has to fake a 
trigger by setting up a timer. Upon the trigger it will call 
reconfigureVertex() and reconfigureDone(). If numTasks >= 0 then do nothing. 
Vertex will move to running state and call VM.start() in which it can invoke 
reconfigureVertex() and reconfigureDone(). In this way, I think the main 
VertexImpl flow will stay close to the normal flow. Thoughts?
Lets converge on this before changing this code again. It may be quicker to 
converge on comments than in review iterations. This is complicated. Thanks for 
patience through all these comments :)

> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
>                 Key: TEZ-2581
>                 URL: https://issues.apache.org/jira/browse/TEZ-2581
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-2.patch, 
> TEZ-2581-WIP-3.patch, TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch, 
> TEZ-2581-WIP-6.patch, TEZ-2581-WIP-7.patch, TEZ-2581-WIP-8.patch, 
> TEZ-2581-WIP-9.patch, TezRecoveryRedesignProposal.pdf, 
> TezRecoveryRedesignV1.1.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-2581) Umbrella for Tez Recovery Redesign

Reply via email to