[
https://issues.apache.org/jira/browse/TEZ-2581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14993233#comment-14993233
]
Bikas Saha commented on TEZ-2581:
---------------------------------
Before we go into that, can you please confirm that a Vertex will be re-run
from scratch with original initializer and vertex manager if the
reconfigureDoneEvent is not recovered for that vertex right? The
VertexInitGeneratedEvent is not relevant because a VM could reconfigureDone
with 0 inputs in the vertex.
>From what I see in the code and from the name vertexReconfigurePlanned - I
>think the understanding/usage is not accurate. E.g. even if
>vertexReconfigurePlanned() API is not invoked we can still be in case 1 (-1 ->
>numTasks). In fact this is the common case. So we should ignore those VM
>semantics completely and try to set a flag that shows whether eventually
>setParallelism (or reconfigureVertex) was called to change the vertex state.
If not called, then NoOpManager has to do nothing except start tasks
onVertexStarted().
If yes, then we have a problem of figuring out when to invoke setParallelism
internally. Here the -1 tasks case becomes a problem because the vertex goes
into initializing state but there is no external trigger that would make our
NoOpVertexManager call setParallelism() to move the vertex into running state.
My caution with calling setParallelism() inside vertexManager is that it is
fragile to the logic inside setParallelism() and also may affect the listeners
of the vertex_parallelism_updated notification. Alternatively, we could just
set the changed data directly in setupVertex/assignVertexManager and have the
NoOpVertexManager invoke an empty() setParallelism/reconfigureVertex during
onVertexStarted(). However that would also be faking the change and any
downstream listeners who depend on numTasks value changing may be fooled. This
is option 1) - variant of the current patch.
Alternatively, if numTasks is -1 and we are setting NoOpVertexManager then we
could send a fake InputInitializer completed to trigger the NoOpVertexManager
to call setParallelism. This would change the VertexImpl code path in a bad
way. This is option 2).
Alternatively, we could put the faking logic into NoOpVertexManager and based
on -1 tasks it could self trigger internally and call
setParallelism/reconfigureVertex with real data. This is option 3). For
external observers and the internal VertexImpl code this keeps their code clean
and as close as possible to the normal flow.
Alternatively, if numTasks is -1 then the NoOpVertexManager could start
listening for a VertexState=Initializing notification from its own vertex.
Then, after the state transition completes and the vertex moves into
initializing state, it will sent a notification about it, which will be
received by the NoOpVertexManager and it will call
setParallelism/reconfigureVertex with real data. This is option 4. This also
keeps the code clean in VertexImpl and external user code. Downside is that
currently VertexState.initialing is currently not a state notification but that
can be added. (org.apache.tez.dag.api.event.VertexState).
Orthogonal to all this is whether NoOpVertexManager invokes
reconfigurationPlanned or not. If it does not call reconfigureVertex then it
should not. But if it calls reconfigureVertex, then calling
reconfigurePlanned() during initialize() and reconfigurationDone() at the end
is the correct API flow. reconfigurePlanned() was made optional only for
backwards compatibility. Calling reconfigurePlanned() will make the code work
regardless of whether initially vertex moves into initializing state (dagPlan
numTasks < 0) or running state (dagPlan numTasks >= 0).
> Umbrella for Tez Recovery Redesign
> ----------------------------------
>
> Key: TEZ-2581
> URL: https://issues.apache.org/jira/browse/TEZ-2581
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Jeff Zhang
> Assignee: Jeff Zhang
> Attachments: TEZ-2581-WIP-1.patch, TEZ-2581-WIP-2.patch,
> TEZ-2581-WIP-3.patch, TEZ-2581-WIP-4.patch, TEZ-2581-WIP-5.patch,
> TEZ-2581-WIP-6.patch, TEZ-2581-WIP-7.patch, TEZ-2581-WIP-8.patch,
> TEZ-2581-WIP-9.patch, TezRecoveryRedesignProposal.pdf,
> TezRecoveryRedesignV1.1.pdf
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)