[
https://issues.apache.org/jira/browse/TEZ-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077237#comment-17077237
]
TezQA commented on TEZ-4140:
----------------------------
| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 9m
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m
0s{color} | {color:green} The patch appears to include 1 new or modified test
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m
30s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m
28s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m
36s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m
29s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m
11s{color} | {color:blue} Used deprecated FindBugs config; considering
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m
8s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m
0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m
8s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 4m
0s{color} | {color:green} tez-dag in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m
8s{color} | {color:green} The patch does not generate ASF License warnings.
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 23m 59s{color} |
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.8 Server=19.03.8 base:
https://builds.apache.org/job/PreCommit-TEZ-Build/336/artifact/out/Dockerfile |
| JIRA Issue | TEZ-4140 |
| JIRA Patch URL |
https://issues.apache.org/jira/secure/attachment/12999236/TEZ-4140.01.patch |
| Optional Tests | dupname asflicense javac javadoc unit spotbugs findbugs
checkstyle compile |
| uname | Linux 4ed1da67933c 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/tez.sh |
| git revision | master / b5b432b |
| Default Java | 1.8.0_242 |
| Test Results |
https://builds.apache.org/job/PreCommit-TEZ-Build/336/testReport/ |
| Max. process+thread count | 218 (vs. ulimit of 5500) |
| modules | C: tez-dag U: tez-dag |
| Console output |
https://builds.apache.org/job/PreCommit-TEZ-Build/336/console |
| versions | git=2.7.4 maven=3.3.9 findbugs=3.0.1 |
| Powered by | Apache Yetus 0.11.1 https://yetus.apache.org |
This message was automatically generated.
> TEZ Recovery: Discrepancy In Scheduling Vertices During Vertex Recovery
> -----------------------------------------------------------------------
>
> Key: TEZ-4140
> URL: https://issues.apache.org/jira/browse/TEZ-4140
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.8.2, 0.9.0, 0.8.4, 0.9.1, 0.9.2
> Reporter: Syed Shameerur Rahman
> Assignee: Syed Shameerur Rahman
> Priority: Major
> Fix For: 0.10.0, 0.9.3
>
> Attachments: DAG.png, TEZ-4140.01.patch
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> *Issue*:
> During vertex recovery, the initialization stage of vertex is skipped if
> 1) VertexInputInitializerEvent
> 2) VertexReconfigureDoneEvent
> are seen in the recovery data. Further the initialization stage is skipped by
> replacing any VertexManagerPlugin (Eg: ShuffleVertexManager,
> CustomVertexManager etc) by NoOpVertexManager. There are couple of issues in
> replacing VertexManagerPlugin with NoOpVertexManager
> 1) Completeness of any VertexManagerPlugin is only after the tasks are
> launched in that vertex, So using NoOpVertexManager without checking whether
> tasks for that particular vertex were launched in previous run might result
> in some kind of discrepancy in deciding when and how many tasks should be
> launched in that vertex during recovery.
> 2) Maintaining vertex dependency:
> Say for example we have two vertices v1 and v2 and v2 is dependent on v1 (v1
> ---> v2), and for some reasons if v1 was not able to skip initialization
> stage and v2 was able to skip initialization stage and there is a chance that
> v2 might get scheduled before v1 since NoOpVertexManager is used.
> The above mentioned problem is what i have faced. Attached a DAG for
> reference:
> In the DAG, Reducer 7 is dependent on Reducer 6 and for some reason during
> Tez Recovery, Reducer 6's initialization stage was not skipped where as
> Reducer 7's initialization stage was skipped and NoOpVertexManager was used
> instead of ShuffleVertexManager which went on to launch all the tasks in
> Reducer 7 before waiting in for Reducer 6's completion. Initially it was
> decided that Reducer 6 will be launching 14 tasks and as per that
> information, Tasks launched in Reducer 7 was waiting for 14 shuffle inputs
> but later due to AutoReduce parallelism No. of tasks in Reducer 6 was
> adjusted to 1 and the Reducer 7's tasks didn't know about this and kept on
> waiting for 14 shuffle inputs but in actual there was only 1, hence the query
> was stuck. This can also lead to deadlock when no. of containers are limited
> and Reducer 7 ends up using all of them.
> *Proposed Solution:*
> In addition to the condition of VertexInputInitializerEvent and
> VertexReconfigureDoneEvent, introduce couple more conditions:
> 1) Check whether tasks were launched in the vertex in the previous run before
> replacing VertexManagerPlugin with NoOpVertexManager
> 2) All the parent vertices should have skipped initialization stage before
> the child vertex does it. This is required to maintain vertex dependency
> CC [~jeagles] [~bikassaha] [~jlowe] [~zjffdu]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)