adoroszlai opened a new pull request, #5945:
URL: https://github.com/apache/ozone/pull/5945
## What changes were proposed in this pull request?
HDDS-8982 added a new assertion in `TestSafeMode` and set timeout of 1
minute for the test case. Encountered the following problem in a recent run:
```
Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 103.8 s <<<
FAILURE! -- in org.apache.hadoop.fs.ozone.TestSafeMode
org.apache.hadoop.fs.ozone.TestSafeMode.o3fs -- Time elapsed: 72.90 s <<<
ERROR!
java.util.concurrent.TimeoutException: o3fs() timed out after 60 seconds
```
Initial `selectContainer` has correctly found none:
```
2024-01-08 10:08:41,553 [main] WARN container.ContainerManagerImpl
(ContainerManagerImpl.java:getMatchingContainer(344)) - Container allocation
failed on pipeline=Pipeline[ Id: 30c296b9-71b9-4744-8977-9b77b35a0eb3, Nodes:
e6c09ac7-730a-4058-a1a8-e64ffc2fb789(fv-az1117-812.frmogvfxo1lepnjzqgguis2xib.cx.internal.cloudapp.net/10.1.0.19)d591e5af-e8f5-464c-8a55-75d5e8fc5b83(fv-az1117-812.frmogvfxo1lepnjzqgguis2xib.cx.internal.cloudapp.net/10.1.0.19)6df5c950-b1f8-41e9-a8bd-47cd1228c3d6(fv-az1117-812.frmogvfxo1lepnjzqgguis2xib.cx.internal.cloudapp.net/10.1.0.19),
ReplicationConfig: RATIS/THREE, State:OPEN,
leaderId:e6c09ac7-730a-4058-a1a8-e64ffc2fb789,
CreationTimestamp2024-01-08T10:08:39.438Z[Etc/UTC]]
java.lang.IllegalArgumentException
at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)
at
org.apache.hadoop.hdds.scm.node.SCMNodeManager.minHealthyVolumeNum(SCMNodeManager.java:1204)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.minHealthyVolumeNum(PipelineManagerImpl.java:669)
at
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getOpenContainerCountPerPipeline(ContainerManagerImpl.java:351)
at
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getMatchingContainer(ContainerManagerImpl.java:331)
at
org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.selectContainer(WritableRatisContainerProvider.java:193)
at
org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:163)
at
org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:92)
at
org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:74)
at
org.apache.hadoop.fs.ozone.TestSafeMode.lambda$testSafeMode$0(TestSafeMode.java:123)
```
Pipeline creation failed, since no datanodes were available:
```
2024-01-08 10:08:41,553 [main] WARN pipeline.WritableRatisContainerProvider
(WritableRatisContainerProvider.java:getContainer(106)) - Pipeline creation
failed for repConfig RATIS/THREE Datanodes may be used up. Try to see if any
pipeline is in ALLOCATED state, and then will wait for it to be OPEN
org.apache.hadoop.hdds.scm.exceptions.SCMException: Ratis pipeline number
meets the limit: 3 replicationConfig : RATIS/THREE
at
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.create(RatisPipelineProvider.java:153)
at
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.create(RatisPipelineProvider.java:57)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineFactory.create(PipelineFactory.java:89)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.createPipeline(PipelineManagerImpl.java:255)
at
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.createPipeline(PipelineManagerImpl.java:241)
at
org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:100)
at
org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:74)
at
org.apache.hadoop.fs.ozone.TestSafeMode.lambda$testSafeMode$0(TestSafeMode.java:123)
```
However, one pipeline was found to be `ALLOCATED`, so the call waited for
that to be opened:
```
2024-01-08 10:09:41,554 [main] WARN pipeline.WritableRatisContainerProvider
(WritableRatisContainerProvider.java:getContainer(122)) - Waiting for one of
pipelines [PipelineID=57157ebf-cb57-4f69-817d-8bea082c3750] to be OPEN failed.
java.io.IOException: Pipeline 57157ebf-cb57-4f69-817d-8bea082c3750 is not
ready in 60000 ms
at
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.waitOnePipelineReady(PipelineManagerImpl.java:772)
at
org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:120)
at
org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:74)
at
org.apache.hadoop.fs.ozone.TestSafeMode.lambda$testSafeMode$0(TestSafeMode.java:123)
```
The problem is that both timeouts are 60 seconds, thus the test may be
aborted just before getting the expected `IOException`.
This PR increases test timeout to 2 minutes. At first I tried to reduce
pipeline report time to avoid unnecessary wait, and it has fixed the original
issue, but hit another intermittent timeout shutting down datanodes (which is
part of the original test, before the `getContainer` call).
https://issues.apache.org/jira/browse/HDDS-10086
## How was this patch tested?
Passed in 10x20 runs:
https://github.com/adoroszlai/ozone/actions/runs/7447762180
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]