Re: [PR] HDDS-14040. Ozone client hang for data write in failure scenario [ozone]

via GitHub Thu, 04 Dec 2025 22:31:46 -0800


sumitagrawl commented on code in PR #9401:
URL: https://github.com/apache/ozone/pull/9401#discussion_r2591565492



##########
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestContainerStateMachineFailures.java:
##########
@@ -805,6 +814,57 @@ void testContainerStateMachineDualFailureRetry()
     validateData("ratis1", 2, "ratisratisratisratis");
   }
 
+  @Test
+  void testContainerStateMachineAllNodeFailure()
+      throws Exception {
+    // mark all dn volume as full to induce failure
+    List<Pair<StorageVolume, Long>> increasedVolumeSpace = new ArrayList<>();
+    cluster.getHddsDatanodes().forEach(
+        dn -> {
+          List<StorageVolume> volumesList = 
dn.getDatanodeStateMachine().getContainer().getVolumeSet().getVolumesList();
+          volumesList.forEach(sv -> {
+            if (sv.getVolumeUsage().isPresent()) {
+              increasedVolumeSpace.add(Pair.of(sv, 
sv.getCurrentUsage().getAvailable()));
+              
sv.getVolumeUsage().get().incrementUsedSpace(sv.getCurrentUsage().getAvailable());
+            }
+          });
+        }
+    );
+
+    long startTime = Time.monotonicNow();
+    ReplicationConfig replicationConfig = 
ReplicationConfig.fromTypeAndFactor(ReplicationType.RATIS,
+        ReplicationFactor.THREE);
+    try (OzoneOutputStream key = 
objectStore.getVolume(volumeName).getBucket(bucketName).createKey(
+        "testkey1", 1024, replicationConfig, new HashMap<>())) {
+
+      key.write("ratis".getBytes(UTF_8));
+      key.flush();
+      fail();
+    } catch (IOException ex) {
+      assertTrue(ex.getMessage().contains("Retry request failed. retries get 
failed due to exceeded" +
+          " maximum allowed retries number: 5"), ex.getMessage());
+    } finally {
+      increasedVolumeSpace.forEach(e -> e.getLeft().getVolumeUsage().ifPresent(
+          p -> p.decrementUsedSpace(e.getRight())));
+      // test execution is less than 2 sec but to be safe putting 30 sec as 
without fix, taking more than 60 sec
+      assertTrue(Time.monotonicNow() - startTime < 30000, "Operation took 
longer than expected: "
+          + (Time.monotonicNow() - startTime));
+    }
+
+    // previous pipeline gets closed due to disk full failure, so created a 
new pipeline and write should succeed,
+    // and this ensures later test case can pass (should not fail due to 
pipeline unavailability as timeout is 200ms

Review Comment:
   starting new cluster reduce test performance,, and all related failure 
handling is done in same test case. IMO is ok to have here only.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-14040. Ozone client hang for data write in failure scenario [ozone]

Reply via email to