[jira] [Commented] (HDDS-13896) Slow failure of metadata volume can cause datanode startup to hang indefinitely without logging

Ethan Rose (Jira) Fri, 07 Nov 2025 16:19:04 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-13896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036431#comment-18036431
 ]


Ethan Rose commented on HDDS-13896:
-----------------------------------

Manually injecting this failure into {{OzoneContainer#start}} repros the 
original issue, where the datanode spins forever but does not log any relevant 
error message. {{TestSecureOzoneRpcClient#testPutKeySuccessWithBlockToken}} is 
just used as a driver for this test case:

{code}
diff --git 
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
 
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
index f0e3f4df8a0..a65fe3f7c9b 100644
--- 
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
+++ 
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
@@ -449,13 +449,22 @@ public void start(String clusterId) throws IOException {
       datanodeDetails.setPort(Name.REPLICATION, replicationServer.getPort());
 
       writeChannel.start();
-      readChannel.start();
-      hddsDispatcher.init();
-      hddsDispatcher.setClusterId(clusterId);
-      blockDeletingService.start();
-      recoveringContainerScrubbingService.start();
-
-      initializingStatus.set(InitializingStatus.INITIALIZED);
+      try {
+        LOG.info("---start sleep");
+        Thread.sleep(60_000);
+        LOG.info("---end sleep");
+      } catch (Exception ex) {
+        LOG.error("---sleep failed", ex);
+      }
+      throw new RuntimeException("---test");
+
+//      readChannel.start();
+//      hddsDispatcher.init();
+//      hddsDispatcher.setClusterId(clusterId);
+//      blockDeletingService.start();
+//      recoveringContainerScrubbingService.start();
+//
+//      initializingStatus.set(InitializingStatus.INITIALIZED);
     } finally {
       // If our status remained uninitialized, then the try block did not 
complete.
       // Mark initialization as failed so other threads can exit.
diff --git 
a/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestSecureOzoneRpcClient.java
 
b/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestSecureOzoneRpcClient.java
index 353023f1860..20793cc88bd 100644
--- 
a/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestSecureOzoneRpcClient.java
+++ 
b/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestSecureOzoneRpcClient.java
@@ -125,7 +125,7 @@ public static void init() throws Exception {
     conf.set(OMConfigKeys.OZONE_DEFAULT_BUCKET_LAYOUT,
         OMConfigKeys.OZONE_BUCKET_LAYOUT_OBJECT_STORE);
     cluster = MiniOzoneCluster.newBuilder(conf)
-        .setNumDatanodes(14)
+        .setNumDatanodes(1)
         .setScmId(SCM_ID)
         .setClusterId(CLUSTER_ID)
         .setCertificateClient(certificateClientTest)
{code}

> Slow failure of metadata volume can cause datanode startup to hang 
> indefinitely without logging
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDDS-13896
>                 URL: https://issues.apache.org/jira/browse/HDDS-13896
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Ozone Datanode
>            Reporter: Ethan Rose
>            Assignee: Ethan Rose
>            Priority: Major
>
> A {{RunningDatanodeState}} instance does not use the same {{ExecutorService}} 
> and {{CompletionService}} across its lifetime. This causes a bug where a 
> {{RuntimeException}} thrown out {{VersionEndpointTask}} could be dropped 
> without logging if the heartbeat timeout had elapsed and a new 
> {{RunningDatanodeState}} + {{CompletionService}} was being polled than the 
> previous instance that threw the exception. One example we observed:
> * After a disk hang, Ratis is unable to read logs from the metadata directory 
> while starting the Ratis server.
> * Ratis throws unchecked `IllegalStateException` or similar when this happens.
> * This exception, which took longer than the heartbeat timeout to show up due 
> to the disk stall, exits {{OzoneContainer#start}} but is not logged.
> ** Due to the locking mechanism in {{OzoneContainer#start}}, no retries can 
> make progress in the method.
> ** Jstacks will show all SCM heartbeat threads in the datanode blocked at the 
> top of {{OzoneContainer#start}}, but the system will not log any errors.
>  We should treat all exceptions thrown from {{OzoneContainer#start}}  as 
> fatal since the operations being done there like starting servers are not 
> idempotent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDDS-13896) Slow failure of metadata volume can cause datanode startup to hang indefinitely without logging

Reply via email to