adoroszlai opened a new pull request, #5945:
URL: https://github.com/apache/ozone/pull/5945

   ## What changes were proposed in this pull request?
   
   HDDS-8982 added a new assertion in `TestSafeMode` and set timeout of 1 
minute for the test case.  Encountered the following problem in a recent run:
   
   ```
   Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 103.8 s <<< 
FAILURE! -- in org.apache.hadoop.fs.ozone.TestSafeMode
   org.apache.hadoop.fs.ozone.TestSafeMode.o3fs -- Time elapsed: 72.90 s <<< 
ERROR!
   java.util.concurrent.TimeoutException: o3fs() timed out after 60 seconds
   ```
   
   Initial `selectContainer` has correctly found none:
   
   ```
   2024-01-08 10:08:41,553 [main] WARN  container.ContainerManagerImpl 
(ContainerManagerImpl.java:getMatchingContainer(344)) - Container allocation 
failed on pipeline=Pipeline[ Id: 30c296b9-71b9-4744-8977-9b77b35a0eb3, Nodes: 
e6c09ac7-730a-4058-a1a8-e64ffc2fb789(fv-az1117-812.frmogvfxo1lepnjzqgguis2xib.cx.internal.cloudapp.net/10.1.0.19)d591e5af-e8f5-464c-8a55-75d5e8fc5b83(fv-az1117-812.frmogvfxo1lepnjzqgguis2xib.cx.internal.cloudapp.net/10.1.0.19)6df5c950-b1f8-41e9-a8bd-47cd1228c3d6(fv-az1117-812.frmogvfxo1lepnjzqgguis2xib.cx.internal.cloudapp.net/10.1.0.19),
 ReplicationConfig: RATIS/THREE, State:OPEN, 
leaderId:e6c09ac7-730a-4058-a1a8-e64ffc2fb789, 
CreationTimestamp2024-01-08T10:08:39.438Z[Etc/UTC]]
   java.lang.IllegalArgumentException
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:129)
        at 
org.apache.hadoop.hdds.scm.node.SCMNodeManager.minHealthyVolumeNum(SCMNodeManager.java:1204)
        at 
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.minHealthyVolumeNum(PipelineManagerImpl.java:669)
        at 
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getOpenContainerCountPerPipeline(ContainerManagerImpl.java:351)
        at 
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.getMatchingContainer(ContainerManagerImpl.java:331)
        at 
org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.selectContainer(WritableRatisContainerProvider.java:193)
        at 
org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:163)
        at 
org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:92)
        at 
org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:74)
        at 
org.apache.hadoop.fs.ozone.TestSafeMode.lambda$testSafeMode$0(TestSafeMode.java:123)
   ```
   
   Pipeline creation failed, since no datanodes were available:
   
   ```
   2024-01-08 10:08:41,553 [main] WARN  pipeline.WritableRatisContainerProvider 
(WritableRatisContainerProvider.java:getContainer(106)) - Pipeline creation 
failed for repConfig RATIS/THREE Datanodes may be used up. Try to see if any 
pipeline is in ALLOCATED state, and then will wait for it to be OPEN
   org.apache.hadoop.hdds.scm.exceptions.SCMException: Ratis pipeline number 
meets the limit: 3 replicationConfig : RATIS/THREE
        at 
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.create(RatisPipelineProvider.java:153)
        at 
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.create(RatisPipelineProvider.java:57)
        at 
org.apache.hadoop.hdds.scm.pipeline.PipelineFactory.create(PipelineFactory.java:89)
        at 
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.createPipeline(PipelineManagerImpl.java:255)
        at 
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.createPipeline(PipelineManagerImpl.java:241)
        at 
org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:100)
        at 
org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:74)
        at 
org.apache.hadoop.fs.ozone.TestSafeMode.lambda$testSafeMode$0(TestSafeMode.java:123)
   ```
   
   However, one pipeline was found to be `ALLOCATED`, so the call waited for 
that to be opened:
   
   ```
   2024-01-08 10:09:41,554 [main] WARN  pipeline.WritableRatisContainerProvider 
(WritableRatisContainerProvider.java:getContainer(122)) - Waiting for one of 
pipelines [PipelineID=57157ebf-cb57-4f69-817d-8bea082c3750] to be OPEN failed. 
   java.io.IOException: Pipeline 57157ebf-cb57-4f69-817d-8bea082c3750 is not 
ready in 60000 ms
        at 
org.apache.hadoop.hdds.scm.pipeline.PipelineManagerImpl.waitOnePipelineReady(PipelineManagerImpl.java:772)
        at 
org.apache.hadoop.hdds.scm.pipeline.WritableRatisContainerProvider.getContainer(WritableRatisContainerProvider.java:120)
        at 
org.apache.hadoop.hdds.scm.pipeline.WritableContainerFactory.getContainer(WritableContainerFactory.java:74)
        at 
org.apache.hadoop.fs.ozone.TestSafeMode.lambda$testSafeMode$0(TestSafeMode.java:123)
   ```
   
   The problem is that both timeouts are 60 seconds, thus the test may be 
aborted just before getting the expected `IOException`.
   
   This PR increases test timeout to 2 minutes.  At first I tried to reduce 
pipeline report time to avoid unnecessary wait, and it has fixed the original 
issue, but hit another intermittent timeout shutting down datanodes (which is 
part of the original test, before the `getContainer` call).
   
   https://issues.apache.org/jira/browse/HDDS-10086
   
   ## How was this patch tested?
   
   Passed in 10x20 runs:
   https://github.com/adoroszlai/ozone/actions/runs/7447762180


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to