[ 
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14993233#comment-14993233
 ] 

Bikas Saha commented on TEZ-2581:
---------------------------------

Before we go into that, can you please confirm that a Vertex will be re-run 
from scratch with original initializer and vertex manager if the 
reconfigureDoneEvent is not recovered for that vertex right? The 
VertexInitGeneratedEvent is not relevant because a VM could reconfigureDone 
with 0 inputs in the vertex.

>From what I see in the code and from the name vertexReconfigurePlanned - I 
>think the understanding/usage is not accurate. E.g. even if 
>vertexReconfigurePlanned() API is not invoked we can still be in case 1 (-1 -> 
>numTasks). In fact this is the common case. So we should ignore those VM 
>semantics completely and try to set a flag that shows whether eventually 
>setParallelism (or reconfigureVertex) was called to change the vertex state. 

If not called, then NoOpManager has to do nothing except start tasks 
onVertexStarted(). 

If yes, then we have a problem of figuring out when to invoke setParallelism 
internally. Here the -1 tasks case becomes a problem because the vertex goes 
into initializing state but there is no external trigger that would make our 
NoOpVertexManager call setParallelism() to move the vertex into running state. 

My caution with calling setParallelism() inside vertexManager is that it is 
fragile to the logic inside setParallelism() and also may affect the listeners 
of the vertex_parallelism_updated notification. Alternatively, we could just 
set the changed data directly in setupVertex/assignVertexManager and have the 
NoOpVertexManager invoke an empty() setParallelism/reconfigureVertex during 
onVertexStarted(). However that would also be faking the change and any 
downstream listeners who depend on numTasks value changing may be fooled. This 
is option 1) - variant of the current patch.

Alternatively, if numTasks is -1 and we are setting NoOpVertexManager then we 
could send a fake InputInitializer completed to trigger the NoOpVertexManager 
to call setParallelism. This would change the VertexImpl code path in a bad 
way. This is option 2).

Alternatively, we could put the faking logic into  NoOpVertexManager and based 
on -1 tasks it could self trigger internally and call 
setParallelism/reconfigureVertex with real data. This is option 3). For 
external observers and the internal VertexImpl code this keeps their code clean 
and as close as possible to the normal flow.

Alternatively, if numTasks is -1 then the NoOpVertexManager could start 
listening for a VertexState=Initializing notification from its own vertex. 
Then, after the state transition completes and the vertex moves into 
initializing state, it will sent a notification about it, which will be 
received by the NoOpVertexManager and it will call 
setParallelism/reconfigureVertex with real data. This is option 4. This also 
keeps the code clean in VertexImpl and external user code. Downside is that 
currently VertexState.initialing is currently not a state notification but that 
can be added. (org.apache.tez.dag.api.event.VertexState).

Orthogonal to all this is whether NoOpVertexManager invokes 
reconfigurationPlanned or not. If it does not call reconfigureVertex then it 
should not. But if it calls reconfigureVertex, then calling 
reconfigurePlanned() during initialize() and reconfigurationDone() at the end 
is the correct API flow. reconfigurePlanned() was made optional only for 
backwards compatibility. Calling reconfigurePlanned() will make the code work 
regardless of whether initially vertex moves into initializing state (dagPlan 
numTasks < 0) or running state (dagPlan numTasks >= 0).

> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
>                 Key: TEZ-2581
>                 URL: https://issues.apache.org/jira/browse/TEZ-2581
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Jeff Zhang
>            Assignee: Jeff Zhang
>         Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-2.patch, 
> TEZ-2581-WIP-3.patch, TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch, 
> TEZ-2581-WIP-6.patch, TEZ-2581-WIP-7.patch, TEZ-2581-WIP-8.patch, 
> TEZ-2581-WIP-9.patch, TezRecoveryRedesignProposal.pdf, 
> TezRecoveryRedesignV1.1.pdf
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to