[
https://issues.apache.org/jira/browse/TEZ-4173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103290#comment-17103290
]
László Bodor edited comment on TEZ-4173 at 5/10/20, 7:05 AM:
-------------------------------------------------------------
seems like it's broken by TEZ-4140, or at least I've run the test successfully
twice in a row with reverted TEZ-4140, and then it failed with the patch for
the first time
cc: [~srahman], if you have any pointers about this
{code:java}
mvn test -Dtest=TestRecovery#testRecovery_OrderedWordCount -pl ./tez-tests -pl
tez-dag
{code}
it stucks here according to the logs:
{code:java}
DAG: State: RUNNING Progress: 0% TotalTasks: 5 Succeeded: 0 Running: 0 Failed:
0 Killed: 0
VertexStatus: VertexName: Tokenizer Progress: 0% TotalTasks: -1
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
VertexStatus: VertexName: Summation Progress: 0% TotalTasks: 5
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
VertexStatus: VertexName: Sorter Progress: 0% TotalTasks: 1 Succeeded:
0 Running: 0 Failed: 0 Killed: 0
{code}
if I remove [the first
condition|https://github.com/apache/tez/commit/e7c24f06e220cb707f114b4f5cc7210d27cce72d#diff-92831ef1b6960a063fa41c2d293823ffR2832]
introduced by the patch, it succeeds:
{code:java}
recoveryData.isVertexTasksStarted()
{code}
the test picks randomly from events and shuts the AM down after the picked
events, and I found that the issue only comes in case of the first vertex's
VertexConfigurationDoneEvent + enableAutoParallelism
when it tries to recover the first vertex, recoveryData.isVertexTasksStarted()
returns false, because recoveryData is present (not null) but
recoveryData.taskRecoveryDataMap is empty, so it doesn't hit the skipping
codepath...so this is an edge case which should be handled if i understand
correctly, where "vertex configure done" event is already seen but tasks are
not stored into the recovery data
uploaded [^TEZ-4173.01.patch] but it's no correct I think, it just makes the
unit test pass, so this needs further investigation
was (Author: abstractdog):
seems like it's broken by TEZ-4140, or at least I've run the test successfully
twice in a row with reverted TEZ-4140, and then it failed with the patch for
the first time
cc: [~srahman], if you have any pointers about this
{code:java}
mvn test -Dtest=TestRecovery#testRecovery_OrderedWordCount -pl ./tez-tests -pl
tez-dag
{code}
it stucks here according to the logs:
{code:java}
DAG: State: RUNNING Progress: 0% TotalTasks: 5 Succeeded: 0 Running: 0 Failed:
0 Killed: 0
VertexStatus: VertexName: Tokenizer Progress: 0% TotalTasks: -1
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
VertexStatus: VertexName: Summation Progress: 0% TotalTasks: 5
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
VertexStatus: VertexName: Sorter Progress: 0% TotalTasks: 1 Succeeded:
0 Running: 0 Failed: 0 Killed: 0
{code}
if I remove [the first
condition|https://github.com/apache/tez/commit/e7c24f06e220cb707f114b4f5cc7210d27cce72d#diff-92831ef1b6960a063fa41c2d293823ffR2832]
introduced by the patch, it succeeds:
{code:java}
recoveryData.isVertexTasksStarted()
{code}
the test picks randomly from events and shuts the AM down after the picked
events, and I found that the issue only comes in case of the first vertex's
VertexConfigurationDoneEvent + enableAutoParallelism
when it tries to recover the first vertex, recoveryData.isVertexTasksStarted()
returns false, because recoveryData is present (not null) but
recoveryData.taskRecoveryDataMap is empty, so it doesn't hit the skipping
codepath...so this is an edge case which should be handled if i understand
correctly, where "vertex configure done" event is already seen but tasks are
not stored into the recovery data
> TestRecovery flaky timeout on master
> ------------------------------------
>
> Key: TEZ-4173
> URL: https://issues.apache.org/jira/browse/TEZ-4173
> Project: Apache Tez
> Issue Type: Bug
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Attachments: TEZ-4173.01.patch, am.jstack.log, surefire.jstack.log,
> tez4173.tar.gz
>
>
> application logs and junit output in [^tez4173.tar.gz]
> one of the running AM's jstack is [^am.jstack.log]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)