adoroszlai opened a new pull request, #4699:
URL: https://github.com/apache/ozone/pull/4699

   ## What changes were proposed in this pull request?
   
   `TestDecommissionAndMaintenance` uses `MiniOzoneClusterProvider` to 
provision clusters in the background.  Tests intermittently fail due to port 
conflict.
   
   ```
   Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 348.139 s 
<<< FAILURE! - in 
org.apache.hadoop.ozone.scm.node.TestDecommissionAndMaintenance
   
org.apache.hadoop.ozone.scm.node.TestDecommissionAndMaintenance.testNodeWithOpenPipelineCanBeDecommissionedAndRecommissioned
  Time elapsed: 159.55 s  <<< ERROR!
   java.util.concurrent.TimeoutException: 
   ...
     at 
org.apache.hadoop.ozone.MiniOzoneClusterImpl.waitForClusterToBeReady(MiniOzoneClusterImpl.java:218)
     at 
org.apache.hadoop.ozone.MiniOzoneClusterImpl.restartHddsDatanode(MiniOzoneClusterImpl.java:431)
     at 
org.apache.hadoop.ozone.scm.node.TestDecommissionAndMaintenance.testNodeWithOpenPipelineCanBeDecommissionedAndRecommissioned(TestDecommissionAndMaintenance.java:234)
   ```
   
   The problem is that, while the datanode is stopped, its ports may be reused 
by some component in a new cluster being provisioned in the background.  The 
original owner of the port fails to start, cluster never becomes ready again.
   
   ```
   2023-05-10 07:26:13,629 [EndpointStateMachine task thread for /0.0.0.0:45947 
- 0 ] INFO  server.GrpcService (GrpcService.java:startImpl(302)) - 
3193002e-fc2b-4cc9-9970-da2531c45e46: GrpcService started, listening on 44925
   ...
   2023-05-10 07:26:37,941 [main] INFO  server.GrpcService 
(GrpcService.java:closeImpl(320)) - 3193002e-fc2b-4cc9-9970-da2531c45e46: 
shutdown server GrpcServerProtocolService successfully
   ...
   2023-05-10 07:26:45,485 [EndpointStateMachine task thread for /0.0.0.0:34213 
- 0 ] INFO  server.GrpcService (GrpcService.java:startImpl(302)) - 
0c852ae0-3c0b-4f2d-b68a-19e305d37000: GrpcService started, listening on 44925
   ...
   2023-05-10 07:26:46,652 [EndpointStateMachine task thread for /0.0.0.0:45947 
- 0 ] INFO  ratis.XceiverServerRatis (XceiverServerRatis.java:start(517)) - 
Starting XceiverServerRatis 3193002e-fc2b-4cc9-9970-da2531c45e46
   2023-05-10 07:26:46,658 [EndpointStateMachine task thread for /0.0.0.0:45947 
- 0 ] ERROR server.GrpcService (ExitUtils.java:terminate(133)) - Terminating 
with exit status 1: Failed to start Grpc server
   java.io.IOException: Failed to bind to address 0.0.0.0/0.0.0.0:44925
   ...
   Caused by: 
org.apache.ratis.thirdparty.io.netty.channel.unix.Errors$NativeIoException: 
bind(..) failed: Address already in use
   ```
   
   This PR replaces random ports with a simple incremental allocation starting 
at 15000.  It applies to all `MiniOzoneCluster`-based tests.
   
   https://issues.apache.org/jira/browse/HDDS-8581
   
   ## How was this patch tested?
   
   CI:
   https://github.com/adoroszlai/hadoop-ozone/actions/runs/4945442087
   
   100x run of `TestDecommissionAndMaintenance`:
   https://github.com/adoroszlai/hadoop-ozone/actions/runs/4944968792


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to