[ 
https://issues.apache.org/jira/browse/HDDS-11485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uma Maheswara Rao G reassigned HDDS-11485:
------------------------------------------

    Assignee: Rishabh Patel

> Datanode doesn't report volume unhealthy when V3 DB folder doesn't exist
> ------------------------------------------------------------------------
>
>                 Key: HDDS-11485
>                 URL: https://issues.apache.org/jira/browse/HDDS-11485
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Duong
>            Assignee: Rishabh Patel
>            Priority: Major
>              Labels: pull-request-available
>
> Tried a simple test in a running cluster, I go to one datanode volume 
> directory, delete the datastore folder (the "DS-<storage id>") under the 
> parent folder "CID-<cluster id>".
> {code:java}
>  <Volume-Root>
>    |-CID-7c19eaaf-701a-4758-b221-9b0de17c0547
>    |---DS-0f05073c-0f96-47e4-a335-f4aa67f20e2c
>    |-----container.db
>    |-----db.checkpoints
>    |-----db.snapshots
>    |-------checkpointState
>    |---tmp
>    |-----deleted-containers
>    |-----disk-check
>  |-VERSION {code}
> and restart the datanode. The result is, when the datanode is up, it finds 
> out that the datastore folder is missing...
> {code:java}
> 2024-09-24 10:32:13 2024-09-24 17:32:13,078 
> [ForkJoinPool.commonPool-worker-19] ERROR ozoneimpl.OzoneContainer: Load db 
> store for HddsVolume /data/hdds/hdds failed
> 2024-09-24 10:32:13 java.io.IOException: Db parent dir 
> /data/hdds/hdds/CID-7c19eaaf-701a-4758-b221-9b0de17c0547/DS-0f05073c-0f96-47e4-a335-f4aa67f20e2c
>  not found for HddsVolume: /data/hdds/hdds
> 2024-09-24 10:32:13     at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.loadDbStore(HddsVolume.java:369)
> 2024-09-24 10:32:13     at 
> org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.loadVolume(HddsVolumeUtil.java:111)
> 2024-09-24 10:32:13     at 
> org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.lambda$loadAllHddsVolumeDbStore$0(HddsVolumeUtil.java:97)
> 2024-09-24 10:32:13     at 
> java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
> 2024-09-24 10:32:13     at 
> java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1728)
> 2024-09-24 10:32:13     at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
> 2024-09-24 10:32:13     at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
> 2024-09-24 10:32:13     at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
> 2024-09-24 10:32:13     at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
> 2024-09-24 10:32:13     at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
> 2024-09-24 10:32:13 2024-09-24 17:32:13,079 [main] INFO 
> ozoneimpl.OzoneContainer: Load 1 volumes DbStore cost: 17ms {code}
> but the error is ignored and the volume is still loaded and used for 
> container creation and data writes. The problematic volume can't handle the 
> container creation and result in a NPE. 
> {code:java}
> 2024-09-12 22:19:48,326 WARN 
> [0823d579-50b5-4c90-ab9d-17b012fd5f85-ChunkWriter-6-0]-org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler:
>  Operation: CreateContainer , Trace ID:  , Message: 
> java.lang.NullPointerException: Base Directory cannot be null , Result: 
> CONTAINER_INTERNAL_ERROR , StorageContainerException Occurred.
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.lang.NullPointerException: Base Directory cannot be null
>     at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:234)
>     at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.createContainer(HddsDispatcher.java:504)
>     at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:300)
>     at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:195)
>     at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
>     at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:194)
>     at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:505)
>     at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$writeStateMachineData$3(ContainerStateMachine.java:559)
>     at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>     at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.NullPointerException: Base Directory cannot be null
>     at 
> com.google.common.base.Preconditions.checkNotNull(Preconditions.java:921)
>     at 
> org.apache.hadoop.ozone.container.keyvalue.helpers.KeyValueContainerLocationUtil.getContainerDBFile(KeyValueContainerLocationUtil.java:127)
>     at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.getContainerDBFile(KeyValueContainer.java:938)
>     at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.create(KeyValueContainer.java:197)
>     at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleCreateContainer(KeyValueHandler.java:380)
>     at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:248)
>     at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:231)
>   {code}
> as this container creation happens during ContainerStateMachine#write (state 
> machine serialization), the NPE crashed the Pipeline's raft log and put the 
> Pipeline in an unusable state. 
> The unhealthy volume should be detected and excluded during container 
> creation or any StateMachine operation. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to