[jira] [Updated] (HDDS-11485) Datanode doesn't report corrupted volume unhealthy

Duong (Jira) Tue, 24 Sep 2024 11:06:18 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-11485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Duong updated HDDS-11485:
-------------------------
    Description: 
Tried a simple test in a running cluster, I go to one datanode volume 
directory, delete the datastore folder (the "DS-<storage id>") under the parent 
folder "CID-<cluster id>".
{code:java}
 <Volume-Root>
   |-CID-7c19eaaf-701a-4758-b221-9b0de17c0547
   |---DS-0f05073c-0f96-47e4-a335-f4aa67f20e2c
   |-----container.db
   |-----db.checkpoints
   |-----db.snapshots
   |-------checkpointState
   |---tmp
   |-----deleted-containers
   |-----disk-check
 |-VERSION {code}

and restart the datanode. The result is, when the datanode is up, it finds out 
that the datastore folder is missing...
{code:java}
2024-09-24 10:32:13 2024-09-24 17:32:13,078 [ForkJoinPool.commonPool-worker-19] 
ERROR ozoneimpl.OzoneContainer: Load db store for HddsVolume /data/hdds/hdds 
failed
2024-09-24 10:32:13 java.io.IOException: Db parent dir 
/data/hdds/hdds/CID-7c19eaaf-701a-4758-b221-9b0de17c0547/DS-0f05073c-0f96-47e4-a335-f4aa67f20e2c
 not found for HddsVolume: /data/hdds/hdds
2024-09-24 10:32:13     at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume.loadDbStore(HddsVolume.java:369)
2024-09-24 10:32:13     at 
org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.loadVolume(HddsVolumeUtil.java:111)
2024-09-24 10:32:13     at 
org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.lambda$loadAllHddsVolumeDbStore$0(HddsVolumeUtil.java:97)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1728)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
2024-09-24 10:32:13 2024-09-24 17:32:13,079 [main] INFO 
ozoneimpl.OzoneContainer: Load 1 volumes DbStore cost: 17ms {code}
but the error is ignored and the volume is still loaded and used for container 
creation and data writes. The problematic volume can't handle the container 
creation and result in a NPE. 
{code:java}
2024-09-12 22:19:48,326 WARN 
[0823d579-50b5-4c90-ab9d-17b012fd5f85-ChunkWriter-6-0]-org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler:
 Operation: CreateContainer , Trace ID:  , Message: 
java.lang.NullPointerException: Base Directory cannot be null , Result: 
CONTAINER_INTERNAL_ERROR , StorageContainerException Occurred.
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: 
java.lang.NullPointerException: Base Directory cannot be null
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:234)
    at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.createContainer(HddsDispatcher.java:504)
    at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:300)
    at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:195)
    at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
    at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:194)
    at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:505)
    at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$writeStateMachineData$3(ContainerStateMachine.java:559)
    at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.NullPointerException: Base Directory cannot be null
    at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:921)
    at 
org.apache.hadoop.ozone.container.keyvalue.helpers.KeyValueContainerLocationUtil.getContainerDBFile(KeyValueContainerLocationUtil.java:127)
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.getContainerDBFile(KeyValueContainer.java:938)
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.create(KeyValueContainer.java:197)
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleCreateContainer(KeyValueHandler.java:380)
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:248)
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:231)
  {code}
as this container creation happens during ContainerStateMachine#write (state 
machine serialization), the NPE crashed the Pipeline's raft log and put the 
Pipeline in an unusable state. 

The unhealthy volume should be detected and excluded during container creation 
or any StateMachine operation. 

  was:
Tried a simple test in a running cluster, I go to one datanode volume 
directory, delete the datastore folder (the "DS-<storage id>") under the parent 
folder "CID-<cluster id>".
   <Volume-Root>
   |-CID-7c19eaaf-701a-4758-b221-9b0de17c0547
   |---DS-0f05073c-0f96-47e4-a335-f4aa67f20e2c
   |-----container.db
   |-----db.checkpoints
   |-----db.snapshots
   |-------checkpointState
   |---tmp
   |-----deleted-containers
   |-----disk-check
   |-VERSION 
and restart the datanode. The result is, when the datanode is up, it finds out 
that the datastore folder is missing...
{code:java}
2024-09-24 10:32:13 2024-09-24 17:32:13,078 [ForkJoinPool.commonPool-worker-19] 
ERROR ozoneimpl.OzoneContainer: Load db store for HddsVolume /data/hdds/hdds 
failed
2024-09-24 10:32:13 java.io.IOException: Db parent dir 
/data/hdds/hdds/CID-7c19eaaf-701a-4758-b221-9b0de17c0547/DS-0f05073c-0f96-47e4-a335-f4aa67f20e2c
 not found for HddsVolume: /data/hdds/hdds
2024-09-24 10:32:13     at 
org.apache.hadoop.ozone.container.common.volume.HddsVolume.loadDbStore(HddsVolume.java:369)
2024-09-24 10:32:13     at 
org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.loadVolume(HddsVolumeUtil.java:111)
2024-09-24 10:32:13     at 
org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.lambda$loadAllHddsVolumeDbStore$0(HddsVolumeUtil.java:97)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1728)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
2024-09-24 10:32:13     at 
java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
2024-09-24 10:32:13 2024-09-24 17:32:13,079 [main] INFO 
ozoneimpl.OzoneContainer: Load 1 volumes DbStore cost: 17ms {code}
but the error is ignored and the volume is still loaded and used for container 
creation and data writes. The problematic volume can't handle the container 
creation and result in a NPE. 
{code:java}
2024-09-12 22:19:48,326 WARN 
[0823d579-50b5-4c90-ab9d-17b012fd5f85-ChunkWriter-6-0]-org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler:
 Operation: CreateContainer , Trace ID:  , Message: 
java.lang.NullPointerException: Base Directory cannot be null , Result: 
CONTAINER_INTERNAL_ERROR , StorageContainerException Occurred.
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: 
java.lang.NullPointerException: Base Directory cannot be null
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:234)
    at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.createContainer(HddsDispatcher.java:504)
    at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:300)
    at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:195)
    at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
    at 
org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:194)
    at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:505)
    at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$writeStateMachineData$3(ContainerStateMachine.java:559)
    at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.NullPointerException: Base Directory cannot be null
    at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:921)
    at 
org.apache.hadoop.ozone.container.keyvalue.helpers.KeyValueContainerLocationUtil.getContainerDBFile(KeyValueContainerLocationUtil.java:127)
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.getContainerDBFile(KeyValueContainer.java:938)
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.create(KeyValueContainer.java:197)
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleCreateContainer(KeyValueHandler.java:380)
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:248)
    at 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:231)
  {code}
as this container creation happens during ContainerStateMachine#write (state 
machine serialization), the NPE crashed the Pipeline's raft log and put the 
Pipeline in an unusable state. 

The unhealthy volume should be detected and excluded during container creation 
or any StateMachine operation. 


> Datanode doesn't report corrupted volume unhealthy
> --------------------------------------------------
>
>                 Key: HDDS-11485
>                 URL: https://issues.apache.org/jira/browse/HDDS-11485
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Duong
>            Priority: Major
>
> Tried a simple test in a running cluster, I go to one datanode volume 
> directory, delete the datastore folder (the "DS-<storage id>") under the 
> parent folder "CID-<cluster id>".
> {code:java}
>  <Volume-Root>
>    |-CID-7c19eaaf-701a-4758-b221-9b0de17c0547
>    |---DS-0f05073c-0f96-47e4-a335-f4aa67f20e2c
>    |-----container.db
>    |-----db.checkpoints
>    |-----db.snapshots
>    |-------checkpointState
>    |---tmp
>    |-----deleted-containers
>    |-----disk-check
>  |-VERSION {code}
> and restart the datanode. The result is, when the datanode is up, it finds 
> out that the datastore folder is missing...
> {code:java}
> 2024-09-24 10:32:13 2024-09-24 17:32:13,078 
> [ForkJoinPool.commonPool-worker-19] ERROR ozoneimpl.OzoneContainer: Load db 
> store for HddsVolume /data/hdds/hdds failed
> 2024-09-24 10:32:13 java.io.IOException: Db parent dir 
> /data/hdds/hdds/CID-7c19eaaf-701a-4758-b221-9b0de17c0547/DS-0f05073c-0f96-47e4-a335-f4aa67f20e2c
>  not found for HddsVolume: /data/hdds/hdds
> 2024-09-24 10:32:13     at 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume.loadDbStore(HddsVolume.java:369)
> 2024-09-24 10:32:13     at 
> org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.loadVolume(HddsVolumeUtil.java:111)
> 2024-09-24 10:32:13     at 
> org.apache.hadoop.ozone.container.common.utils.HddsVolumeUtil.lambda$loadAllHddsVolumeDbStore$0(HddsVolumeUtil.java:97)
> 2024-09-24 10:32:13     at 
> java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
> 2024-09-24 10:32:13     at 
> java.base/java.util.concurrent.CompletableFuture$AsyncRun.exec(CompletableFuture.java:1728)
> 2024-09-24 10:32:13     at 
> java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
> 2024-09-24 10:32:13     at 
> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
> 2024-09-24 10:32:13     at 
> java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
> 2024-09-24 10:32:13     at 
> java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
> 2024-09-24 10:32:13     at 
> java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
> 2024-09-24 10:32:13 2024-09-24 17:32:13,079 [main] INFO 
> ozoneimpl.OzoneContainer: Load 1 volumes DbStore cost: 17ms {code}
> but the error is ignored and the volume is still loaded and used for 
> container creation and data writes. The problematic volume can't handle the 
> container creation and result in a NPE. 
> {code:java}
> 2024-09-12 22:19:48,326 WARN 
> [0823d579-50b5-4c90-ab9d-17b012fd5f85-ChunkWriter-6-0]-org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler:
>  Operation: CreateContainer , Trace ID:  , Message: 
> java.lang.NullPointerException: Base Directory cannot be null , Result: 
> CONTAINER_INTERNAL_ERROR , StorageContainerException Occurred.
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.lang.NullPointerException: Base Directory cannot be null
>     at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:234)
>     at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.createContainer(HddsDispatcher.java:504)
>     at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:300)
>     at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.lambda$dispatch$0(HddsDispatcher.java:195)
>     at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
>     at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:194)
>     at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:505)
>     at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$writeStateMachineData$3(ContainerStateMachine.java:559)
>     at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>     at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>     at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.NullPointerException: Base Directory cannot be null
>     at 
> com.google.common.base.Preconditions.checkNotNull(Preconditions.java:921)
>     at 
> org.apache.hadoop.ozone.container.keyvalue.helpers.KeyValueContainerLocationUtil.getContainerDBFile(KeyValueContainerLocationUtil.java:127)
>     at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.getContainerDBFile(KeyValueContainer.java:938)
>     at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.create(KeyValueContainer.java:197)
>     at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleCreateContainer(KeyValueHandler.java:380)
>     at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(KeyValueHandler.java:248)
>     at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:231)
>   {code}
> as this container creation happens during ContainerStateMachine#write (state 
> machine serialization), the NPE crashed the Pipeline's raft log and put the 
> Pipeline in an unusable state. 
> The unhealthy volume should be detected and excluded during container 
> creation or any StateMachine operation. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-11485) Datanode doesn't report corrupted volume unhealthy

Reply via email to