[
https://issues.apache.org/jira/browse/HDDS-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096722#comment-17096722
]
Stephen O'Donnell commented on HDDS-3504:
-----------------------------------------
Looking at a few of the failures, it seems its the original freon step that is
failing:
{code}
2020/04/17/789:
<msg timestamp="20200417 18:01:32.551" level="INFO">Running command 'ozone
freon randomkeys --numOfVolumes 5 --numOfBuckets 5 --numOfKeys 5 --numOfThreads
1 --replicationType RATIS --factor THREE --validateWrites 2>&1'.</msg>
<msg timestamp="20200417 18:06:32.549" level="FAIL">Test timeout 5 minutes
exceeded.</msg>
2020/04/17/791:
<msg timestamp="20200417 18:50:52.139" level="INFO">Running command 'ozone
freon randomkeys --numOfVolumes 5 --numOfBuckets 5 --numOfKeys 5 --numOfThreads
1 --replicationType RATIS --factor THREE --validateWrites 2>&1'.</msg>
<msg timestamp="20200417 18:55:52.135" level="FAIL">Test timeout 5 minutes
exceeded.</msg>
<status status="FAIL" starttime="20200417 18:50:52.135" endtime="20200417
18:55:52.135"></status>
2020/04/20/792:
<msg timestamp="20200420 12:11:47.412" level="INFO">Running command 'ozone
freon randomkeys --numOfVolumes 5 --numOfBuckets 5 --numOfKeys 5 --numOfThreads
1 --replicationType RATIS --factor THREE --validateWrites 2>&1'.</msg>
<msg timestamp="20200420 12:16:47.410" level="FAIL">Test timeout 5 minutes
exceeded.</msg>
<status status="FAIL" starttime="20200420 12:11:47.410" endtime="20200420
12:16:47.410"></status>
{code}
Focusing on the last one above, safemode certainly exited:
{code}
scm_1 | 2020-04-20 12:11:33,974
[EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO
safemode.SCMSafeModeManager: SCM in safe mode. 4 DataNodes registered, 4
required.
scm_1 | 2020-04-20 12:11:33,975
[EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO
safemode.SCMSafeModeManager: DataNodeSafeModeRule rule is successfully validated
scm_1 | 2020-04-20 12:11:33,976
[EventQueue-NodeRegistrationContainerReportForDataNodeSafeModeRule] INFO
safemode.SCMSafeModeManager: All SCM safe mode pre check rules have passed
scm_1 | 2020-04-20 12:11:34,235
[EventQueue-NodeRegistrationContainerReportForContainerSafeModeRule] INFO
safemode.SCMSafeModeManager: ContainerSafeModeRule rule is successfully
validated
...
scm_1 | 2020-04-20 12:11:41,680
[EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO
safemode.SCMSafeModeManager: SCM in safe mode. Healthy pipelines reported count
is 1, required healthy pipeline reported count is 1
...
scm_1 | 2020-04-20 12:11:41,682
[EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO
safemode.SCMSafeModeManager: SCM exiting safe mode.
{code}
However the role logs are filled with disk space related errors, eg:
{code}
Caused by: org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: Out of
space: The volume with the most available space (=965574656 B) is less than the
container size (=1073741824 B).
datanode_5_1 | at
org.apache.hadoop.ozone.container.common.volume.RoundRobinVolumeChoosingPolicy.chooseVolume(RoundRobinVolumeChoosingPolicy.java:77)
datanode_5_1 | at
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.create(KeyValueContainer.java:124)
datanode_5_1 | ... 13 more
{code}
This message is appeared on various DN logs. I will attach the full log here
for reference.
The failure on 2020/04/17/791 and 2020/04/17/789 also have the space errors.
This run 2020/04/27/830 also failed at the freon step, but there are no disk
space issues:
{code}
<msg timestamp="20200427 22:00:52.826" level="INFO">Running command 'ozone
freon randomkeys --numOfVolumes 5 --numOfBuckets 5 --numOfKeys 5 --numOfThreads
1 --replicationType RATIS --factor THREE --validateWrites 2>&1'.</msg>
<msg timestamp="20200427 22:05:52.823" level="FAIL">Test timeout 5 minutes
exceeded.</msg>
<status status="FAIL" starttime="20200427 22:00:52.824" endtime="20200427
22:05:52.823"></status>
{code}
I will attach the log for it too, so we have it tracked here. I am not sure why
it failed.
Based on these 3 occurrences, it seems a lot of the instability is caused by
space issues rather than the changes made to the test in HDDS-3084 / HDDS-3135
as that code is not even getting executed (test fails before it reaches that
step).
> Topology related acceptance test is intermittent
> ------------------------------------------------
>
> Key: HDDS-3504
> URL: https://issues.apache.org/jira/browse/HDDS-3504
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Reporter: Marton Elek
> Priority: Critical
>
> It's failed multiple times on master and PRs. See the archive of the builds
> results at: https://github.com/elek/ozone-build-results
> I am disabling it to avoid flaky runs, but we need to fix and re-enable it.
> {code}
> ./find-first.sh "Test timeout 5 minutes exceeded."
> --include="robot-ozone-topology-ozone-topology-basic-scm.xml"
> 2020/04/17/789/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<msg
> timestamp="20200417 18:06:32.549" level="FAIL">Test timeout 5 minutes
> exceeded.</msg>
> 2020/04/17/789/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<status
> status="FAIL" starttime="20200417 18:01:32.547" endtime="20200417
> 18:06:32.550" critical="yes">Test timeout 5 minutes exceeded.</status>
> 2020/04/17/791/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<msg
> timestamp="20200417 18:55:52.135" level="FAIL">Test timeout 5 minutes
> exceeded.</msg>
> 2020/04/17/791/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<status
> status="FAIL" starttime="20200417 18:50:52.134" endtime="20200417
> 18:55:52.135" critical="yes">Test timeout 5 minutes exceeded.</status>
> 2020/04/20/792/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<msg
> timestamp="20200420 12:16:47.410" level="FAIL">Test timeout 5 minutes
> exceeded.</msg>
> 2020/04/20/792/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<status
> status="FAIL" starttime="20200420 12:11:47.409" endtime="20200420
> 12:16:47.411" critical="yes">Test timeout 5 minutes exceeded.</status>
> 2020/04/21/803/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<msg
> timestamp="20200422 00:10:32.971" level="FAIL">Test timeout 5 minutes
> exceeded.</msg>
> 2020/04/21/803/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<status
> status="FAIL" starttime="20200422 00:05:32.970" endtime="20200422
> 00:10:32.972" critical="yes">Test timeout 5 minutes exceeded.</status>
> 2020/04/27/830/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<msg
> timestamp="20200427 22:05:52.823" level="FAIL">Test timeout 5 minutes
> exceeded.</msg>
> 2020/04/27/830/acceptance/robot-ozone-topology-ozone-topology-basic-scm.xml:<status
> status="FAIL" starttime="20200427 22:00:52.822" endtime="20200427
> 22:05:52.824" critical="yes">Test timeout 5 minutes exceeded.</status>
> First failed commit: 3bb5838196536f2ea4ac1ab4dcd0bc53ae97f7e0
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]