[ 
https://issues.apache.org/jira/browse/TEZ-4140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077237#comment-17077237
 ] 

TezQA commented on TEZ-4140:
----------------------------

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  9m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
30s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
28s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
36s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m 
11s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
8s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
8s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  4m  
0s{color} | {color:green} tez-dag in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
 8s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 23m 59s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.8 Server=19.03.8 base: 
https://builds.apache.org/job/PreCommit-TEZ-Build/336/artifact/out/Dockerfile |
| JIRA Issue | TEZ-4140 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12999236/TEZ-4140.01.patch |
| Optional Tests | dupname asflicense javac javadoc unit spotbugs findbugs 
checkstyle compile |
| uname | Linux 4ed1da67933c 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | personality/tez.sh |
| git revision | master / b5b432b |
| Default Java | 1.8.0_242 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-TEZ-Build/336/testReport/ |
| Max. process+thread count | 218 (vs. ulimit of 5500) |
| modules | C: tez-dag U: tez-dag |
| Console output | 
https://builds.apache.org/job/PreCommit-TEZ-Build/336/console |
| versions | git=2.7.4 maven=3.3.9 findbugs=3.0.1 |
| Powered by | Apache Yetus 0.11.1 https://yetus.apache.org |


This message was automatically generated.



> TEZ Recovery: Discrepancy In Scheduling Vertices During Vertex Recovery
> -----------------------------------------------------------------------
>
>                 Key: TEZ-4140
>                 URL: https://issues.apache.org/jira/browse/TEZ-4140
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.8.2, 0.9.0, 0.8.4, 0.9.1, 0.9.2
>            Reporter: Syed Shameerur Rahman
>            Assignee: Syed Shameerur Rahman
>            Priority: Major
>             Fix For: 0.10.0, 0.9.3
>
>         Attachments: DAG.png, TEZ-4140.01.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> *Issue*:
> During vertex recovery, the initialization stage of vertex is skipped if
> 1) VertexInputInitializerEvent
> 2) VertexReconfigureDoneEvent
> are seen in the recovery data. Further the initialization stage is skipped by 
> replacing any VertexManagerPlugin (Eg: ShuffleVertexManager, 
> CustomVertexManager etc) by NoOpVertexManager. There are couple of issues in 
> replacing VertexManagerPlugin with NoOpVertexManager
> 1) Completeness of any VertexManagerPlugin is only after the tasks are 
> launched in that vertex, So using NoOpVertexManager without checking whether 
> tasks for that particular vertex were launched in previous run might result 
> in some kind of discrepancy in deciding when and how many tasks should be 
> launched in that vertex during recovery.
> 2) Maintaining vertex dependency:
> Say for example we have two vertices v1 and v2 and v2 is dependent on v1 (v1 
> ---> v2), and for some reasons if v1 was not able to skip initialization 
> stage and v2 was able to skip initialization stage and there is a chance that 
> v2 might get scheduled before v1 since NoOpVertexManager is used.
> The above mentioned problem is what i have faced. Attached a DAG for 
> reference:
> In the DAG, Reducer 7 is dependent on Reducer 6 and for some reason during 
> Tez Recovery, Reducer 6's initialization stage was not skipped where as 
> Reducer 7's initialization stage was skipped and NoOpVertexManager was used 
> instead of ShuffleVertexManager which went on to launch all the tasks in 
> Reducer 7 before waiting in for Reducer 6's completion. Initially it was 
> decided that Reducer 6 will be launching 14 tasks and as per that 
> information, Tasks launched in Reducer 7 was waiting for 14 shuffle inputs 
> but later due to AutoReduce parallelism No. of tasks in Reducer 6 was 
> adjusted to 1 and the Reducer 7's tasks didn't know about this and kept on 
> waiting for 14 shuffle inputs but in actual there was only 1, hence the query 
> was stuck. This can also lead to deadlock when no. of containers are limited 
> and Reducer 7 ends up using all of them.
> *Proposed Solution:*
> In addition to the condition of VertexInputInitializerEvent and 
> VertexReconfigureDoneEvent, introduce couple more conditions:
> 1) Check whether tasks were launched in the vertex in the previous run before 
> replacing VertexManagerPlugin with NoOpVertexManager
> 2) All the parent vertices should have skipped initialization stage before 
> the child vertex does it. This is required to maintain vertex dependency
> CC [~jeagles] [~bikassaha] [~jlowe] [~zjffdu]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to