[ 
https://issues.apache.org/jira/browse/TEZ-4173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103290#comment-17103290
 ] 

László Bodor edited comment on TEZ-4173 at 5/10/20, 7:11 AM:
-------------------------------------------------------------

seems like it's broken by TEZ-4140, or at least I've run the test successfully 
twice in a row with reverted TEZ-4140, and then it failed with the patch for 
the first time
 cc: [~srahman], if you have any pointers about this
{code:java}
 mvn test -Dtest=TestRecovery#testRecovery_OrderedWordCount -pl ./tez-tests -pl 
tez-dag
{code}
with the patch [^TEZ-4173.reproduction.patch] , it always hits the problematic 
case

it stucks here according to the logs:
{code:java}
DAG: State: RUNNING Progress: 0% TotalTasks: 5 Succeeded: 0 Running: 0 Failed: 
0 Killed: 0
        VertexStatus: VertexName: Tokenizer Progress: 0% TotalTasks: -1 
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
        VertexStatus: VertexName: Summation Progress: 0% TotalTasks: 5 
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
        VertexStatus: VertexName: Sorter Progress: 0% TotalTasks: 1 Succeeded: 
0 Running: 0 Failed: 0 Killed: 0
{code}
if I remove [the first 
condition|https://github.com/apache/tez/commit/e7c24f06e220cb707f114b4f5cc7210d27cce72d#diff-92831ef1b6960a063fa41c2d293823ffR2832]
 introduced by the patch, it succeeds:
{code:java}
recoveryData.isVertexTasksStarted()
{code}
the test picks randomly from events and shuts the AM down after the picked 
events, and I found that the issue only comes in case of the first vertex's 
VertexConfigurationDoneEvent + enableAutoParallelism
 when it tries to recover the first vertex, recoveryData.isVertexTasksStarted() 
returns false, because recoveryData is present (not null) but 
recoveryData.taskRecoveryDataMap is empty,  so it doesn't hit the skipping 
codepath...so this is an edge case which should be handled if i understand 
correctly, where "vertex configure done" event is already seen but tasks are 
not stored into the recovery data

uploaded  [^TEZ-4173.01.patch]  but it's not correct I think, it just makes the 
unit test pass, so this needs further investigation


was (Author: abstractdog):
seems like it's broken by TEZ-4140, or at least I've run the test successfully 
twice in a row with reverted TEZ-4140, and then it failed with the patch for 
the first time
 cc: [~srahman], if you have any pointers about this
{code:java}
 mvn test -Dtest=TestRecovery#testRecovery_OrderedWordCount -pl ./tez-tests -pl 
tez-dag
{code}
it stucks here according to the logs:
{code:java}
DAG: State: RUNNING Progress: 0% TotalTasks: 5 Succeeded: 0 Running: 0 Failed: 
0 Killed: 0
        VertexStatus: VertexName: Tokenizer Progress: 0% TotalTasks: -1 
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
        VertexStatus: VertexName: Summation Progress: 0% TotalTasks: 5 
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
        VertexStatus: VertexName: Sorter Progress: 0% TotalTasks: 1 Succeeded: 
0 Running: 0 Failed: 0 Killed: 0
{code}
if I remove [the first 
condition|https://github.com/apache/tez/commit/e7c24f06e220cb707f114b4f5cc7210d27cce72d#diff-92831ef1b6960a063fa41c2d293823ffR2832]
 introduced by the patch, it succeeds:
{code:java}
recoveryData.isVertexTasksStarted()
{code}
the test picks randomly from events and shuts the AM down after the picked 
events, and I found that the issue only comes in case of the first vertex's 
VertexConfigurationDoneEvent + enableAutoParallelism
 when it tries to recover the first vertex, recoveryData.isVertexTasksStarted() 
returns false, because recoveryData is present (not null) but 
recoveryData.taskRecoveryDataMap is empty,  so it doesn't hit the skipping 
codepath...so this is an edge case which should be handled if i understand 
correctly, where "vertex configure done" event is already seen but tasks are 
not stored into the recovery data

uploaded  [^TEZ-4173.01.patch]  but it's not correct I think, it just makes the 
unit test pass, so this needs further investigation

> TestRecovery flaky timeout on master
> ------------------------------------
>
>                 Key: TEZ-4173
>                 URL: https://issues.apache.org/jira/browse/TEZ-4173
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TEZ-4173.01.patch, TEZ-4173.reproduction.patch, 
> am.jstack.log, surefire.jstack.log, tez4173.tar.gz
>
>
> application logs and junit output in  [^tez4173.tar.gz] 
> one of the running AM's jstack is  [^am.jstack.log] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to