[
https://issues.apache.org/jira/browse/TEZ-4173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103290#comment-17103290
]
László Bodor edited comment on TEZ-4173 at 5/9/20, 4:59 PM:
------------------------------------------------------------
seems like it's broken by TEZ-4140, or at least I've run the test successfully
twice in a row with reverted TEZ-4140, and then it failed with the patch for
the first time
cc: [~srahman], if you have any pointers about this
{code:java}
mvn test -Dtest=TestRecovery#testRecovery_OrderedWordCount -pl ./tez-tests -pl
tez-dag
{code}
it stucks here according to the logs:
{code:java}
DAG: State: RUNNING Progress: 0% TotalTasks: 5 Succeeded: 0 Running: 0 Failed:
0 Killed: 0
VertexStatus: VertexName: Tokenizer Progress: 0% TotalTasks: -1
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
VertexStatus: VertexName: Summation Progress: 0% TotalTasks: 5
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
VertexStatus: VertexName: Sorter Progress: 0% TotalTasks: 1 Succeeded:
0 Running: 0 Failed: 0 Killed: 0
{code}
if I remove [both of the
conditions|https://github.com/apache/tez/commit/e7c24f06e220cb707f114b4f5cc7210d27cce72d#diff-92831ef1b6960a063fa41c2d293823ffR2832]
introduced by the patch, it succeeds:
{code:java}
recoveryData.isVertexTasksStarted() && isVertexInitSkippedInParentVertices()
{code}
the test picks randomly from events and shuts the AM down after the picked
events, and I found that the issue only comes in case of the first vertex's
VertexConfigurationDoneEvent + enableAutoParallelism
when it tries to recover the first vertex, recoveryData.isVertexTasksStarted()
returns false, because recoveryData.taskRecoveryDataMap is empty, so it
doesn't hit the skipping codepath...so this is an edge case which should be
handled if i understand correctly, where vertex configure event is already seen
but tasks are not stored into the recovery data
was (Author: abstractdog):
seems like it's broken by TEZ-4140, or at least I've run the test successfully
twice in a row with reverted TEZ-4140, and then it failed with the patch for
the first time
cc: [~srahman], if you have any pointers about this
{code}
mvn test -Dtest=TestRecovery#testRecovery_OrderedWordCount -pl ./tez-tests -pl
tez-dag
{code}
it stucks here according to the logs:
{code}
DAG: State: RUNNING Progress: 0% TotalTasks: 5 Succeeded: 0 Running: 0 Failed:
0 Killed: 0
VertexStatus: VertexName: Tokenizer Progress: 0% TotalTasks: -1
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
VertexStatus: VertexName: Summation Progress: 0% TotalTasks: 5
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
VertexStatus: VertexName: Sorter Progress: 0% TotalTasks: 1 Succeeded:
0 Running: 0 Failed: 0 Killed: 0
{code}
if I remove [any of the
conditions|https://github.com/apache/tez/commit/e7c24f06e220cb707f114b4f5cc7210d27cce72d#diff-92831ef1b6960a063fa41c2d293823ffR2832]
introduced by the patch, it succeeds:
{code}
recoveryData.isVertexTasksStarted() && isVertexInitSkippedInParentVertices()
{code}
the test picks randomly from events and shuts the AM down after the picked
events, and I found that the issue only comes in case of the first vertex's
VertexConfigurationDoneEvent + enableAutoParallelism
> TestRecovery flaky timeout on master
> ------------------------------------
>
> Key: TEZ-4173
> URL: https://issues.apache.org/jira/browse/TEZ-4173
> Project: Apache Tez
> Issue Type: Bug
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Attachments: am.jstack.log, surefire.jstack.log, tez4173.tar.gz
>
>
> application logs and junit output in [^tez4173.tar.gz]
> one of the running AM's jstack is [^am.jstack.log]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)