sumitagrawl commented on code in PR #9401:
URL: https://github.com/apache/ozone/pull/9401#discussion_r2591565492
##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java:
##########
@@ -805,6 +814,57 @@ void testContainerStateMachineDualFailureRetry()
validateData("ratis1", 2, "ratisratisratisratis");
}
+ @Test
+ void testContainerStateMachineAllNodeFailure()
+ throws Exception {
+ // mark all dn volume as full to induce failure
+ List<Pair<StorageVolume, Long>> increasedVolumeSpace = new ArrayList<>();
+ cluster.getHddsDatanodes().forEach(
+ dn -> {
+ List<StorageVolume> volumesList =
dn.getDatanodeStateMachine().getContainer().getVolumeSet().getVolumesList();
+ volumesList.forEach(sv -> {
+ if (sv.getVolumeUsage().isPresent()) {
+ increasedVolumeSpace.add(Pair.of(sv,
sv.getCurrentUsage().getAvailable()));
+
sv.getVolumeUsage().get().incrementUsedSpace(sv.getCurrentUsage().getAvailable());
+ }
+ });
+ }
+ );
+
+ long startTime = Time.monotonicNow();
+ ReplicationConfig replicationConfig =
ReplicationConfig.fromTypeAndFactor(ReplicationType.RATIS,
+ ReplicationFactor.THREE);
+ try (OzoneOutputStream key =
objectStore.getVolume(volumeName).getBucket(bucketName).createKey(
+ "testkey1", 1024, replicationConfig, new HashMap<>())) {
+
+ key.write("ratis".getBytes(UTF_8));
+ key.flush();
+ fail();
+ } catch (IOException ex) {
+ assertTrue(ex.getMessage().contains("Retry request failed. retries get
failed due to exceeded" +
+ " maximum allowed retries number: 5"), ex.getMessage());
+ } finally {
+ increasedVolumeSpace.forEach(e -> e.getLeft().getVolumeUsage().ifPresent(
+ p -> p.decrementUsedSpace(e.getRight())));
+ // test execution is less than 2 sec but to be safe putting 30 sec as
without fix, taking more than 60 sec
+ assertTrue(Time.monotonicNow() - startTime < 30000, "Operation took
longer than expected: "
+ + (Time.monotonicNow() - startTime));
+ }
+
+ // previous pipeline gets closed due to disk full failure, so created a
new pipeline and write should succeed,
+ // and this ensures later test case can pass (should not fail due to
pipeline unavailability as timeout is 200ms
Review Comment:
starting new cluster reduce test performance,, and all related failure
handling is done in same test case. IMO is ok to have here only.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]