[jira] [Commented] (HDDS-3504) Topology related acceptance test is intermittent

Stephen O'Donnell (Jira) Thu, 30 Apr 2020 09:36:25 -0700


    [ 
https://issues.apache.org/jira/browse/HDDS-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096722#comment-17096722
 ]


Stephen O'Donnell commented on HDDS-3504:
-----------------------------------------

Looking at a few of the failures, it seems its the original freon step that is 
failing:

{code}
2020/04/17/789:

<msg timestamp="20200417 18:01:32.551" level="INFO">Running command 'ozone 
freon randomkeys --numOfVolumes 5 --numOfBuckets 5 --numOfKeys 5 --numOfThreads 
1 --replicationType RATIS --factor THREE --validateWrites 2&gt;&amp;1'.</msg>
<msg timestamp="20200417 18:06:32.549" level="FAIL">Test timeout 5 minutes 
exceeded.</msg>

2020/04/17/791:

<msg timestamp="20200417 18:50:52.139" level="INFO">Running command 'ozone 
freon randomkeys --numOfVolumes 5 --numOfBuckets 5 --numOfKeys 5 --numOfThreads 
1 --replicationType RATIS --factor THREE --validateWrites 2&gt;&amp;1'.</msg>
<msg timestamp="20200417 18:55:52.135" level="FAIL">Test timeout 5 minutes 
exceeded.</msg>
<status status="FAIL" starttime="20200417 18:50:52.135" endtime="20200417 
18:55:52.135"></status>


2020/04/20/792:

<msg timestamp="20200420 12:11:47.412" level="INFO">Running command 'ozone 
freon randomkeys --numOfVolumes 5 --numOfBuckets 5 --numOfKeys 5 --numOfThreads 
1 --replicationType RATIS --factor THREE --validateWrites 2&gt;&amp;1'.</msg>
<msg timestamp="20200420 12:16:47.410" level="FAIL">Test timeout 5 minutes 
exceeded.</msg>
<status status="FAIL" starttime="20200420 12:11:47.410" endtime="20200420 
12:16:47.410"></status>

{code}

Focusing on the last one above, safemode certainly exited:

{code}
scm_1         | 2020-04-20 12:11:33,974 
[EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO 
safemode.SCMSafeModeManager: SCM in safe mode. 4 DataNodes registered, 4 
required.
scm_1         | 2020-04-20 12:11:33,975 
[EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO 
safemode.SCMSafeModeManager: DataNodeSafeModeRule rule is successfully validated
scm_1         | 2020-04-20 12:11:33,976 
[EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO 
safemode.SCMSafeModeManager: All SCM safe mode pre check rules have passed
scm_1         | 2020-04-20 12:11:34,235 
[EventQueue-NodeRegistrationContainerReportForContainerSafeModeRule] INFO 
safemode.SCMSafeModeManager: ContainerSafeModeRule rule is successfully 
validated
...
scm_1         | 2020-04-20 12:11:41,680 
[EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO 
safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count 
is 1, required healthy pipeline reported count is 1
...
scm_1         | 2020-04-20 12:11:41,682 
[EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO 
safemode.SCMSafeModeManager: SCM exiting safe mode.
{code}

However the role logs are filled with disk space related errors, eg:

{code}
Caused by: org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: Out of 
space: The volume with the most available space (=965574656 B) is less than the 
container size (=1073741824 B).
datanode_5_1  |         at 
org.apache.hadoop.ozone.container.common.volume.RoundRobinVolumeChoosingPolicy.chooseVolume(RoundRobinVolumeChoosingPolicy.java:77)
datanode_5_1  |         at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.create(KeyValueContainer.java:124)
datanode_5_1  |         ... 13 more
{code}

This message is appeared on various DN logs. I will attach the full log here 
for reference.

The failure on 2020/04/17/791 and 2020/04/17/789 also have the space errors.

This run 2020/04/27/830 also failed at the freon step, but there are no disk 
space issues:

{code}
<msg timestamp="20200427 22:00:52.826" level="INFO">Running command 'ozone 
freon randomkeys --numOfVolumes 5 --numOfBuckets 5 --numOfKeys 5 --numOfThreads 
1 --replicationType RATIS --factor THREE --validateWrites 2&gt;&amp;1'.</msg>
<msg timestamp="20200427 22:05:52.823" level="FAIL">Test timeout 5 minutes 
exceeded.</msg>
<status status="FAIL" starttime="20200427 22:00:52.824" endtime="20200427 
22:05:52.823"></status>
{code}

I will attach the log for it too, so we have it tracked here. I am not sure why 
it failed.

Based on these 3 occurrences, it seems a lot of the instability is caused by 
space issues rather than the changes made to the test in HDDS-3084 / HDDS-3135 
as that code is not even getting executed (test fails before it reaches that 
step).

> Topology related acceptance test is intermittent
> ------------------------------------------------
>
>                 Key: HDDS-3504
>                 URL: https://issues.apache.org/jira/browse/HDDS-3504
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: Marton Elek
>            Priority: Critical
>
> It's failed multiple times on master and PRs. See the archive of the builds 
> results at: https://github.com/elek/ozone-build-results 
> I am disabling it to avoid flaky runs, but we need to fix and re-enable it.
> {code}
> ./find-first.sh "Test timeout 5 minutes exceeded." 
> --include="robot-ozone-topology-ozone-topology-basic-scm.xml"
> 2020/04/17/789/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<msg
>  timestamp="20200417 18:06:32.549" level="FAIL">Test timeout 5 minutes 
> exceeded.</msg>
> 2020/04/17/789/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<status
>  status="FAIL" starttime="20200417 18:01:32.547" endtime="20200417 
> 18:06:32.550" critical="yes">Test timeout 5 minutes exceeded.</status>
> 2020/04/17/791/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<msg
>  timestamp="20200417 18:55:52.135" level="FAIL">Test timeout 5 minutes 
> exceeded.</msg>
> 2020/04/17/791/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<status
>  status="FAIL" starttime="20200417 18:50:52.134" endtime="20200417 
> 18:55:52.135" critical="yes">Test timeout 5 minutes exceeded.</status>
> 2020/04/20/792/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<msg
>  timestamp="20200420 12:16:47.410" level="FAIL">Test timeout 5 minutes 
> exceeded.</msg>
> 2020/04/20/792/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<status
>  status="FAIL" starttime="20200420 12:11:47.409" endtime="20200420 
> 12:16:47.411" critical="yes">Test timeout 5 minutes exceeded.</status>
> 2020/04/21/803/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<msg
>  timestamp="20200422 00:10:32.971" level="FAIL">Test timeout 5 minutes 
> exceeded.</msg>
> 2020/04/21/803/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<status
>  status="FAIL" starttime="20200422 00:05:32.970" endtime="20200422 
> 00:10:32.972" critical="yes">Test timeout 5 minutes exceeded.</status>
> 2020/04/27/830/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<msg
>  timestamp="20200427 22:05:52.823" level="FAIL">Test timeout 5 minutes 
> exceeded.</msg>
> 2020/04/27/830/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<status
>  status="FAIL" starttime="20200427 22:00:52.822" endtime="20200427 
> 22:05:52.824" critical="yes">Test timeout 5 minutes exceeded.</status>
> First failed commit: 3bb5838196536f2ea4ac1ab4dcd0bc53ae97f7e0
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-3504) Topology related acceptance test is intermittent

Reply via email to