[ 
https://issues.apache.org/jira/browse/TEZ-4173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103290#comment-17103290
 ] 

László Bodor edited comment on TEZ-4173 at 5/9/20, 4:59 PM:
------------------------------------------------------------

seems like it's broken by TEZ-4140, or at least I've run the test successfully 
twice in a row with reverted TEZ-4140, and then it failed with the patch for 
the first time
 cc: [~srahman], if you have any pointers about this
{code:java}
 mvn test -Dtest=TestRecovery#testRecovery_OrderedWordCount -pl ./tez-tests -pl 
tez-dag
{code}
it stucks here according to the logs:
{code:java}
DAG: State: RUNNING Progress: 0% TotalTasks: 5 Succeeded: 0 Running: 0 Failed: 
0 Killed: 0
        VertexStatus: VertexName: Tokenizer Progress: 0% TotalTasks: -1 
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
        VertexStatus: VertexName: Summation Progress: 0% TotalTasks: 5 
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
        VertexStatus: VertexName: Sorter Progress: 0% TotalTasks: 1 Succeeded: 
0 Running: 0 Failed: 0 Killed: 0
{code}
if I remove [both of the 
conditions|https://github.com/apache/tez/commit/e7c24f06e220cb707f114b4f5cc7210d27cce72d#diff-92831ef1b6960a063fa41c2d293823ffR2832]
 introduced by the patch, it succeeds:
{code:java}
recoveryData.isVertexTasksStarted() && isVertexInitSkippedInParentVertices()
{code}
the test picks randomly from events and shuts the AM down after the picked 
events, and I found that the issue only comes in case of the first vertex's 
VertexConfigurationDoneEvent + enableAutoParallelism
 when it tries to recover the first vertex, recoveryData.isVertexTasksStarted() 
returns false, because recoveryData.taskRecoveryDataMap is empty,  so it 
doesn't hit the skipping codepath...so this is an edge case which should be 
handled if i understand correctly, where "vertex configure done" event is 
already seen but tasks are not stored into the recovery data


was (Author: abstractdog):
seems like it's broken by TEZ-4140, or at least I've run the test successfully 
twice in a row with reverted TEZ-4140, and then it failed with the patch for 
the first time
 cc: [~srahman], if you have any pointers about this
{code:java}
 mvn test -Dtest=TestRecovery#testRecovery_OrderedWordCount -pl ./tez-tests -pl 
tez-dag
{code}
it stucks here according to the logs:
{code:java}
DAG: State: RUNNING Progress: 0% TotalTasks: 5 Succeeded: 0 Running: 0 Failed: 
0 Killed: 0
        VertexStatus: VertexName: Tokenizer Progress: 0% TotalTasks: -1 
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
        VertexStatus: VertexName: Summation Progress: 0% TotalTasks: 5 
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
        VertexStatus: VertexName: Sorter Progress: 0% TotalTasks: 1 Succeeded: 
0 Running: 0 Failed: 0 Killed: 0
{code}
if I remove [both of the 
conditions|https://github.com/apache/tez/commit/e7c24f06e220cb707f114b4f5cc7210d27cce72d#diff-92831ef1b6960a063fa41c2d293823ffR2832]
 introduced by the patch, it succeeds:
{code:java}
recoveryData.isVertexTasksStarted() && isVertexInitSkippedInParentVertices()
{code}
the test picks randomly from events and shuts the AM down after the picked 
events, and I found that the issue only comes in case of the first vertex's 
VertexConfigurationDoneEvent + enableAutoParallelism
 when it tries to recover the first vertex, recoveryData.isVertexTasksStarted() 
returns false, because recoveryData.taskRecoveryDataMap is empty,  so it 
doesn't hit the skipping codepath...so this is an edge case which should be 
handled if i understand correctly, where vertex configure event is already seen 
but tasks are not stored into the recovery data

> TestRecovery flaky timeout on master
> ------------------------------------
>
>                 Key: TEZ-4173
>                 URL: https://issues.apache.org/jira/browse/TEZ-4173
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: am.jstack.log, surefire.jstack.log, tez4173.tar.gz
>
>
> application logs and junit output in  [^tez4173.tar.gz] 
> one of the running AM's jstack is  [^am.jstack.log] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to