[
https://issues.apache.org/jira/browse/TEZ-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17078237#comment-17078237
]
Syed Shameerur Rahman commented on TEZ-4140:
--------------------------------------------
[~jeagles] [~bikassaha] [~jlowe] [~zjffdu] Can you please review?
> TEZ Recovery: Discrepancy In Scheduling Vertices During Vertex Recovery
> -----------------------------------------------------------------------
>
> Key: TEZ-4140
> URL: https://issues.apache.org/jira/browse/TEZ-4140
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.8.2, 0.9.0, 0.8.4, 0.9.1, 0.9.2
> Reporter: Syed Shameerur Rahman
> Assignee: Syed Shameerur Rahman
> Priority: Major
> Fix For: 0.10.0, 0.9.3
>
> Attachments: DAG.png, TEZ-4140.01.patch
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> *Issue*:
> During vertex recovery, the initialization stage of vertex is skipped if
> 1) VertexInputInitializerEvent
> 2) VertexReconfigureDoneEvent
> are seen in the recovery data. Further the initialization stage is skipped by
> replacing any VertexManagerPlugin (Eg: ShuffleVertexManager,
> CustomVertexManager etc) by NoOpVertexManager. There are couple of issues in
> replacing VertexManagerPlugin with NoOpVertexManager
> 1) Completeness of any VertexManagerPlugin is only after the tasks are
> launched in that vertex, So using NoOpVertexManager without checking whether
> tasks for that particular vertex were launched in previous run might result
> in some kind of discrepancy in deciding when and how many tasks should be
> launched in that vertex during recovery.
> 2) Maintaining vertex dependency:
> Say for example we have two vertices v1 and v2 and v2 is dependent on v1 (v1
> ---> v2), and for some reasons if v1 was not able to skip initialization
> stage and v2 was able to skip initialization stage and there is a chance that
> v2 might get scheduled before v1 since NoOpVertexManager is used.
> The above mentioned problem is what i have faced. Attached a DAG for
> reference:
> !DAG.png!
> In the DAG, Reducer 7 is dependent on Reducer 6 and for some reason during
> Tez Recovery, Reducer 6's initialization stage was not skipped where as
> Reducer 7's initialization stage was skipped and NoOpVertexManager was used
> instead of ShuffleVertexManager which went on to launch all the tasks in
> Reducer 7 before waiting in for Reducer 6's completion. Initially it was
> decided that Reducer 6 will be launching 14 tasks and as per that
> information, Tasks launched in Reducer 7 was waiting for 14 shuffle inputs
> but later due to AutoReduce parallelism No. of tasks in Reducer 6 was
> adjusted to 1 and the Reducer 7's tasks didn't know about this and kept on
> waiting for 14 shuffle inputs but in actual there was only 1, hence the query
> was stuck. This can also lead to deadlock when no. of containers are limited
> and Reducer 7 ends up using all of them.
> *Proposed Solution:*
> In addition to the condition of VertexInputInitializerEvent and
> VertexReconfigureDoneEvent, introduce couple more conditions:
> 1) Check whether tasks were launched in the vertex in the previous run before
> replacing VertexManagerPlugin with NoOpVertexManager
> 2) All the parent vertices should have skipped initialization stage before
> the child vertex does it. This is required to maintain vertex dependency
--
This message was sent by Atlassian Jira
(v8.3.4#803005)