[
https://issues.apache.org/jira/browse/TEZ-4173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103290#comment-17103290
]
László Bodor edited comment on TEZ-4173 at 5/9/20, 3:13 PM:
------------------------------------------------------------
seems like it's broken by TEZ-4140, or at least I've run the test successfully
twice in a row with reverted TEZ-4140, and then it failed with the patch for
the first time
cc: [~srahman], if you have any pointers about this
{code}
mvn test -Dtest=TestRecovery#testRecovery_OrderedWordCount -pl ./tez-tests -pl
tez-dag
{code}
it stucks here according to the logs:
{code}
DAG: State: RUNNING Progress: 0% TotalTasks: 5 Succeeded: 0 Running: 0 Failed:
0 Killed: 0
VertexStatus: VertexName: Tokenizer Progress: 0% TotalTasks: -1
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
VertexStatus: VertexName: Summation Progress: 0% TotalTasks: 5
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
VertexStatus: VertexName: Sorter Progress: 0% TotalTasks: 1 Succeeded:
0 Running: 0 Failed: 0 Killed: 0
{code}
if I remove [any of the
conditions|https://github.com/apache/tez/commit/e7c24f06e220cb707f114b4f5cc7210d27cce72d#diff-92831ef1b6960a063fa41c2d293823ffR2832]
introduced by the patch, it succeeds:
{code}
recoveryData.isVertexTasksStarted() && isVertexInitSkippedInParentVertices()
{code}
the test picks randomly from events and shuts the AM down after the picked
events, and I found that the issue only comes in case of
VertexConfigurationDoneEvent + enableAutoParallelism
was (Author: abstractdog):
seems like it's broken by TEZ-4140, or at least I've run the test successfully
twice in a row with reverted TEZ-4140, and then it failed with the patch for
the first time
cc: [~srahman], if you have any pointers about this
{code}
mvn test -Dtest=TestRecovery#testRecovery_OrderedWordCount -pl ./tez-tests -pl
tez-dag
{code}
it stucks here according to the logs:
{code}
DAG: State: RUNNING Progress: 0% TotalTasks: 5 Succeeded: 0 Running: 0 Failed:
0 Killed: 0
VertexStatus: VertexName: Tokenizer Progress: 0% TotalTasks: -1
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
VertexStatus: VertexName: Summation Progress: 0% TotalTasks: 5
Succeeded: 0 Running: 0 Failed: 0 Killed: 0
VertexStatus: VertexName: Sorter Progress: 0% TotalTasks: 1 Succeeded:
0 Running: 0 Failed: 0 Killed: 0
{code}
if I remove [any of the
conditions|https://github.com/apache/tez/commit/e7c24f06e220cb707f114b4f5cc7210d27cce72d#diff-92831ef1b6960a063fa41c2d293823ffR2832]
introduced by the patch, it succeeds:
{code}
recoveryData.isVertexTasksStarted() && isVertexInitSkippedInParentVertices()
{code}
> TestRecovery flaky timeout on master
> ------------------------------------
>
> Key: TEZ-4173
> URL: https://issues.apache.org/jira/browse/TEZ-4173
> Project: Apache Tez
> Issue Type: Bug
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Attachments: am.jstack.log, surefire.jstack.log, tez4173.tar.gz
>
>
> application logs and junit output in [^tez4173.tar.gz]
> one of the running AM's jstack is [^am.jstack.log]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)