szetszwo commented on code in PR #9401:
URL: https://github.com/apache/ozone/pull/9401#discussion_r2590190338
##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java:
##########
@@ -805,6 +814,57 @@ void testContainerStateMachineDualFailureRetry()
validateData("ratis1", 2, "ratisratisratisratis");
}
+ @Test
+ void testContainerStateMachineAllNodeFailure()
+ throws Exception {
+ // mark all dn volume as full to induce failure
+ List<Pair<StorageVolume, Long>> increasedVolumeSpace = new ArrayList<>();
+ cluster.getHddsDatanodes().forEach(
+ dn -> {
+ List<StorageVolume> volumesList =
dn.getDatanodeStateMachine().getContainer().getVolumeSet().getVolumesList();
+ volumesList.forEach(sv -> {
+ if (sv.getVolumeUsage().isPresent()) {
+ increasedVolumeSpace.add(Pair.of(sv,
sv.getCurrentUsage().getAvailable()));
+
sv.getVolumeUsage().get().incrementUsedSpace(sv.getCurrentUsage().getAvailable());
+ }
+ });
+ }
+ );
+
+ long startTime = Time.monotonicNow();
+ ReplicationConfig replicationConfig =
ReplicationConfig.fromTypeAndFactor(ReplicationType.RATIS,
+ ReplicationFactor.THREE);
+ try (OzoneOutputStream key =
objectStore.getVolume(volumeName).getBucket(bucketName).createKey(
+ "testkey1", 1024, replicationConfig, new HashMap<>())) {
+
+ key.write("ratis".getBytes(UTF_8));
+ key.flush();
+ fail();
+ } catch (IOException ex) {
+ assertTrue(ex.getMessage().contains("Retry request failed. retries get
failed due to exceeded" +
+ " maximum allowed retries number: 5"), ex.getMessage());
+ } finally {
+ increasedVolumeSpace.forEach(e -> e.getLeft().getVolumeUsage().ifPresent(
+ p -> p.decrementUsedSpace(e.getRight())));
+ // test execution is less than 2 sec but to be safe putting 30 sec as
without fix, taking more than 60 sec
+ assertTrue(Time.monotonicNow() - startTime < 30000, "Operation took
longer than expected: "
+ + (Time.monotonicNow() - startTime));
+ }
+
+ // previous pipeline gets closed due to disk full failure, so created a
new pipeline and write should succeed,
+ // and this ensures later test case can pass (should not fail due to
pipeline unavailability as timeout is 200ms
Review Comment:
Let's move this to a new test file. It is hard to debug if individual tests
are affecting each others.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]