[ 
https://issues.apache.org/jira/browse/TEZ-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033585#comment-17033585
 ] 

László Bodor edited comment on TEZ-4123 at 2/10/20 2:03 PM:
------------------------------------------------------------

I cannot see that TEZ-3664 contains any fix regarding this test class flakiness

UPDATE: it turned out that yarn disk healthcheck disabled the node, in logs:
{code}
2020-02-10 14:36:15,160 INFO  [RM Event dispatcher] rmnode.RMNodeImpl 
(RMNodeImpl.java:transition(1209)) - Node abstractdog-440s:42154 reported 
UNHEALTHY with details: 1/1 local-dirs usable space is below configured 
utilization percentage/no more usable space [ 
/home/abstractdog/apache/tez/tez-tests/target/org.apache.tez.mapreduce.TestMRRJobsDAGApi/org.apache.tez.mapreduce.TestMRRJobsDAGApi-localDir-nm-0_0
 : used space above threshold of 90.0% ] ; 1/1 log-dirs usable space is below 
configured utilization percentage/no more usable space [ 
/home/abstractdog/apache/tez/tez-tests/target/org.apache.tez.mapreduce.TestMRRJobsDAGApi/org.apache.tez.mapreduce.TestMRRJobsDAGApi-logDir-nm-0_0
 : used space above threshold of 90.0% ] 
{code}

seems like setting the percentage threshold solved the issue (my disk is on 93%)
{code}
conf.setInt("yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage",
 99);
{code}

[~jeagles]: what do you think about setting this property for 99% in every 
cases where MiniTezCluster is used in tests? I think the current 90% could 
bring random flakiness into tests, where the environment itself is healthy (and 
suitable for testing simple yarn based tez unit tests), but yarn silently 
disables the testing node...we used to see similar issues in hive preCommit 
tests as well



was (Author: abstractdog):
I cannot see that TEZ-3664 contains any fix regarding this test class flakiness

> TestMRRJobsDAGApi flaky timeout
> -------------------------------
>
>                 Key: TEZ-4123
>                 URL: https://issues.apache.org/jira/browse/TEZ-4123
>             Project: Apache Tez
>          Issue Type: Test
>            Reporter: László Bodor
>            Priority: Major
>         Attachments: TestMRRJobsDAGApi.out, 
> org.apache.tez.mapreduce.TestMRRJobsDAGApi-output.txt
>
>
> Failed in both precommit and on master locally:
> {code}
> mvn clean install -pl ./tez-tests -Dtest=TestMRRJobsDAGApi
> {code}
> surefire process thread dump:  [^TestMRRJobsDAGApi.out] 
> test output:  [^org.apache.tez.mapreduce.TestMRRJobsDAGApi-output.txt] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to