[
https://issues.apache.org/jira/browse/HDDS-11485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Uma Maheswara Rao G reassigned HDDS-11485:
------------------------------------------
Assignee: Rishabh Patel
> Datanode doesn't report volume unhealthy when V3 DB folder doesn't exist
> ------------------------------------------------------------------------
>
> Key: HDDS-11485
> URL: https://issues.apache.org/jira/browse/HDDS-11485
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Duong
> Assignee: Rishabh Patel
> Priority: Major
> Labels: pull-request-available
>
> Tried a simple test in a running cluster, I go to one datanode volume
> directory, delete the datastore folder (the "DS-<storage id>") under the
> parent folder "CID-<cluster id>".
> {code:java}
> <Volume-Root>
> |-CID-7c19eaaf-701a-4758-b221-9b0de17c0547
> |---DS-0f05073c-0f96-47e4-a335-f4aa67f20e2c
> |-----container.db
> |-----db.checkpoints
> |-----db.snapshots
> |-------checkpointState
> |---tmp
> |-----deleted-containers
> |-----disk-check
> |-VERSION {code}
> and restart the datanode. The result is, when the datanode is up, it finds
> out that the datastore folder is missing...
> {code:java}
> 2024-09-24 10:32:13 2024-09-24 17:32:13,078
> [ForkJoinPool.commonPool-worker-19] ERROR ozoneimpl.OzoneContainer: Load db
> store for HddsVolume /data/hdds/hdds failed
> 2024-09-24 10:32:13 java.io.IOException: Db parent dir
> /data/hdds/hdds/CID-7c19eaaf-701a-4758-b221-9b0de17c0547/DS-0f05073c-0f96-47e4-a335-f4aa67f20e2c
> not found for HddsVolume: /data/hdds/hdds
> 2024-09-24 10:32:13 at
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.loadDbStore(HddsVolume.java:369)
> 2024-09-24 10:32:13 at
> org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.loadVolume(HddsVolumeUtil.java:111)
> 2024-09-24 10:32:13 at
> org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.lambda$loadAllHddsVolumeDbStore$0(HddsVolumeUtil.java:97)
> 2024-09-24 10:32:13 at
> java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
> 2024-09-24 10:32:13 at
> java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1728)
> 2024-09-24 10:32:13 at
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
> 2024-09-24 10:32:13 at
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
> 2024-09-24 10:32:13 at
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
> 2024-09-24 10:32:13 at
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
> 2024-09-24 10:32:13 at
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
> 2024-09-24 10:32:13 2024-09-24 17:32:13,079 [main] INFO
> ozoneimpl.OzoneContainer: Load 1 volumes DbStore cost: 17ms {code}
> but the error is ignored and the volume is still loaded and used for
> container creation and data writes. The problematic volume can't handle the
> container creation and result in a NPE.
> {code:java}
> 2024-09-12 22:19:48,326 WARN
> [0823d579-50b5-4c90-ab9d-17b012fd5f85-ChunkWriter-6-0]-org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler:
> Operation: CreateContainer , Trace ID: , Message:
> java.lang.NullPointerException: Base Directory cannot be null , Result:
> CONTAINER_INTERNAL_ERROR , StorageContainerException Occurred.
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
> java.lang.NullPointerException: Base Directory cannot be null
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:234)
> at
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.createContainer(HddsDispatcher.java:504)
> at
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:300)
> at
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:195)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
> at
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:194)
> at
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:505)
> at
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$writeStateMachineData$3(ContainerStateMachine.java:559)
> at
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.NullPointerException: Base Directory cannot be null
> at
> com.google.common.base.Preconditions.checkNotNull(Preconditions.java:921)
> at
> org.apache.hadoop.ozone.container.keyvalue.helpers.KeyValueContainerLocationUtil.getContainerDBFile(KeyValueContainerLocationUtil.java:127)
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.getContainerDBFile(KeyValueContainer.java:938)
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.create(KeyValueContainer.java:197)
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleCreateContainer(KeyValueHandler.java:380)
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:248)
> at
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:231)
> {code}
> as this container creation happens during ContainerStateMachine#write (state
> machine serialization), the NPE crashed the Pipeline's raft log and put the
> Pipeline in an unusable state.
> The unhealthy volume should be detected and excluded during container
> creation or any StateMachine operation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]