[ 
https://issues.apache.org/jira/browse/TEZ-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194989#comment-14194989
 ] 

Siddharth Seth commented on TEZ-1547:
-------------------------------------

bq. If anything nudged you towards that then let me know and I can add 
clarifying comments.
I guess what gives this impression is the event going out after the vertex has 
started. More on this later. Also, the fact that there's two issues that were 
fixed via TEZ-1494, TEZ-1597, TEZ-1749 - one being the hang, and the second 
being out of order scheduling which would cause tasks belonging to consumers 
eating up the cluster before producers start - thus causing unnecessary 
slowdowns. The need to remove startAfterOneTaskComplete.

Thanks for the clarification. Will change the title to reflect the same - it's 
fairly vague at the moment.

bq.  As you can see, the Shuffle VM first notifies of configuration complete, 
and then proceeds to schedule tasks based on the existing min/max logic.
Wouldn't this lead to a situation where the notified vertex - ends up 
scheduling it's tasks before the ShuffleVertex can schedule any tasks ? Even 
though at a different priority - any task on the downstream vertex which starts 
up will end up taking up cluster capacity unnecessarily. That effectively 
undoes the changes from 1494, etc for ShuffleEdges.

On the event going out after the Vertex has started.
There's 3 cases
1) A pre-configured vertex which intends to re-configure. This has to go 
through the vertexReconfigurationPlanned API. The CONFIGURED event can go out 
the moment doneReconfiguringVertex is invoked, without the requirement for the 
start check.
2) An unconfigured vertex, making use of the vertexReconfigurationPlanned API. 
The same applies - send event as soon as doneReconfiguringVertex is invoked 
(without the startTime check - looks like this check went out in the .9 patch)
3) An unconfigured vertex, not making use of the vertexReconfigurationPlanned 
API. This won't reach the StartVertex transition until it;s configured 
(parallelism != -1, null edges etc). Even if the RECONFIGURED event is sent out 
in startVertex() - it's safe to send it out before invoking onVertexStarted.

For 1, we should also look at when vertexReconfigurationPlanned can be invoked. 
Constructor / initialize() seems to be the only place where this method should 
be called, and should be disallowed after this.

Now if we move the maybeSendConfiguredEvent() before onVertexStarted - which is 
correct and safe for the intent of the event - it's very likely that downstream 
vertices will end up scheduling tasks before the srcVertex. Take the case of 
ImmediateStartVertexManager. Sending the event after onVertexStarted happens to 
be giving us the behaviour that we want, in terms of correct scheduling order.

So, essentially, I'm trying to understand why the event goes out after 
onVertexStarted, even though we can send it out much earlier (even from within 
startVertex). If it's for the correct scheduling behaviour right now - lets 
just get the patch in and fix the scheduling later.

If we're going with this - I think the startTime check in 
doneReconfiguringVertex is important, also ShuffleVertexManager should be 
scheduling tasks before indicating that it's doneReconfiguring (primarily to 
avoid regressing on the scheduling order). Also documenting when 
vertexReconfigurationPlanned can be invoked.



> Make use of state change notifier in VertexManagerPlugins
> ---------------------------------------------------------
>
>                 Key: TEZ-1547
>                 URL: https://issues.apache.org/jira/browse/TEZ-1547
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Bikas Saha
>         Attachments: TEZ-1547.1.patch, TEZ-1547.10.patch, TEZ-1547.3.patch, 
> TEZ-1547.4.patch, TEZ-1547.5.patch, TEZ-1547.6.patch, TEZ-1547.7.patch, 
> TEZ-1547.8.patch, TEZ-1547.9.patch
>
>
> Instead of the various APIs like onVertexStarted, simple notifications could 
> be sent.
> Some existing APIs could end up being deprecated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to