[
https://issues.apache.org/jira/browse/HDDS-13896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036431#comment-18036431
]
Ethan Rose commented on HDDS-13896:
-----------------------------------
Manually injecting this failure into {{OzoneContainer#start}} repros the
original issue, where the datanode spins forever but does not log any relevant
error message. {{TestSecureOzoneRpcClient#testPutKeySuccessWithBlockToken}} is
just used as a driver for this test case:
{code}
diff --git
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
index f0e3f4df8a0..a65fe3f7c9b 100644
---
a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
+++
b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
@@ -449,13 +449,22 @@ public void start(String clusterId) throws IOException {
datanodeDetails.setPort(Name.REPLICATION, replicationServer.getPort());
writeChannel.start();
- readChannel.start();
- hddsDispatcher.init();
- hddsDispatcher.setClusterId(clusterId);
- blockDeletingService.start();
- recoveringContainerScrubbingService.start();
-
- initializingStatus.set(InitializingStatus.INITIALIZED);
+ try {
+ LOG.info("---start sleep");
+ Thread.sleep(60_000);
+ LOG.info("---end sleep");
+ } catch (Exception ex) {
+ LOG.error("---sleep failed", ex);
+ }
+ throw new RuntimeException("---test");
+
+// readChannel.start();
+// hddsDispatcher.init();
+// hddsDispatcher.setClusterId(clusterId);
+// blockDeletingService.start();
+// recoveringContainerScrubbingService.start();
+//
+// initializingStatus.set(InitializingStatus.INITIALIZED);
} finally {
// If our status remained uninitialized, then the try block did not
complete.
// Mark initialization as failed so other threads can exit.
diff --git
a/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestSecureOzoneRpcClient.java
b/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestSecureOzoneRpcClient.java
index 353023f1860..20793cc88bd 100644
---
a/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestSecureOzoneRpcClient.java
+++
b/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/client/rpc/TestSecureOzoneRpcClient.java
@@ -125,7 +125,7 @@ public static void init() throws Exception {
conf.set(OMConfigKeys.OZONE_DEFAULT_BUCKET_LAYOUT,
OMConfigKeys.OZONE_BUCKET_LAYOUT_OBJECT_STORE);
cluster = MiniOzoneCluster.newBuilder(conf)
- .setNumDatanodes(14)
+ .setNumDatanodes(1)
.setScmId(SCM_ID)
.setClusterId(CLUSTER_ID)
.setCertificateClient(certificateClientTest)
{code}
> Slow failure of metadata volume can cause datanode startup to hang
> indefinitely without logging
> -----------------------------------------------------------------------------------------------
>
> Key: HDDS-13896
> URL: https://issues.apache.org/jira/browse/HDDS-13896
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode
> Reporter: Ethan Rose
> Assignee: Ethan Rose
> Priority: Major
>
> A {{RunningDatanodeState}} instance does not use the same {{ExecutorService}}
> and {{CompletionService}} across its lifetime. This causes a bug where a
> {{RuntimeException}} thrown out {{VersionEndpointTask}} could be dropped
> without logging if the heartbeat timeout had elapsed and a new
> {{RunningDatanodeState}} + {{CompletionService}} was being polled than the
> previous instance that threw the exception. One example we observed:
> * After a disk hang, Ratis is unable to read logs from the metadata directory
> while starting the Ratis server.
> * Ratis throws unchecked `IllegalStateException` or similar when this happens.
> * This exception, which took longer than the heartbeat timeout to show up due
> to the disk stall, exits {{OzoneContainer#start}} but is not logged.
> ** Due to the locking mechanism in {{OzoneContainer#start}}, no retries can
> make progress in the method.
> ** Jstacks will show all SCM heartbeat threads in the datanode blocked at the
> top of {{OzoneContainer#start}}, but the system will not log any errors.
> We should treat all exceptions thrown from {{OzoneContainer#start}} as
> fatal since the operations being done there like starting servers are not
> idempotent.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]