Duong created HDDS-11485:
----------------------------

             Summary: Datanode doesn't report corrupted volume unhealthy
                 Key: HDDS-11485
                 URL: https://issues.apache.org/jira/browse/HDDS-11485
             Project: Apache Ozone
          Issue Type: Bug
            Reporter: Duong


Tried a simple test in a running cluster, I go to one datanode volume 
directory, delete the datastore folder (the "DS-<storage id>") under the parent 
folder "CID-<cluster id>".
   <Volume-Root>
   |-CID-7c19eaaf-701a-4758-b221-9b0de17c0547
   |---DS-0f05073c-0f96-47e4-a335-f4aa67f20e2c
   |-----container.db
   |-----db.checkpoints
   |-----db.snapshots
   |-------checkpointState
   |---tmp
   |-----deleted-containers
   |-----disk-check
   |-VERSION 
and restart the datanode. The result is, when the datanode is up, it finds out 
that the datastore folder is missing...
{code:java}
2024-09-24 10:32:13 2024-09-24 17:32:13,078 [ForkJoinPool.commonPool-worker-19] 
ERROR ozoneimpl.OzoneContainer: Load db store for HddsVolume /data/hdds/hdds 
failed
2024-09-24 10:32:13 java.io.IOException: Db parent dir 
/data/hdds/hdds/CID-7c19eaaf-701a-4758-b221-9b0de17c0547/DS-0f05073c-0f96-47e4-a335-f4aa67f20e2c
 not found for HddsVolume: /data/hdds/hdds
2024-09-24 10:32:13     at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume.loadDbStore(HddsVolume.java:369)
2024-09-24 10:32:13     at 
org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.loadVolume(HddsVolumeUtil.java:111)
2024-09-24 10:32:13     at 
org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.lambda$loadAllHddsVolumeDbStore$0(HddsVolumeUtil.java:97)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1728)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
2024-09-24 10:32:13 2024-09-24 17:32:13,079 [main] INFO 
ozoneimpl.OzoneContainer: Load 1 volumes DbStore cost: 17ms {code}
but the error is ignored and the volume is still loaded and used for container 
creation and data writes. The problematic volume can't handle the container 
creation and result in a NPE. 
{code:java}
2024-09-12 22:19:48,326 WARN 
[0823d579-50b5-4c90-ab9d-17b012fd5f85-ChunkWriter-6-0]-org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler:
 Operation: CreateContainer , Trace ID:  , Message: 
java.lang.NullPointerException: Base Directory cannot be null , Result: 
CONTAINER_INTERNAL_ERROR , StorageContainerException Occurred.
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: 
java.lang.NullPointerException: Base Directory cannot be null
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:234)
    at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.createContainer(HddsDispatcher.java:504)
    at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:300)
    at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:195)
    at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
    at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:194)
    at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:505)
    at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$writeStateMachineData$3(ContainerStateMachine.java:559)
    at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.NullPointerException: Base Directory cannot be null
    at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:921)
    at 
org.apache.hadoop.ozone.container.keyvalue.helpers.KeyValueContainerLocationUtil.getContainerDBFile(KeyValueContainerLocationUtil.java:127)
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.getContainerDBFile(KeyValueContainer.java:938)
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.create(KeyValueContainer.java:197)
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleCreateContainer(KeyValueHandler.java:380)
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:248)
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:231)
  {code}
as this container creation happens during ContainerStateMachine#write (state 
machine serialization), the NPE crashed the Pipeline's raft log and put the 
Pipeline in an unusable state. 

The unhealthy volume should be detected and excluded during container creation 
or any StateMachine operation. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to