761417898 opened a new pull request, #15676:
URL: https://github.com/apache/iotdb/pull/15676
## Description
### Key Changes
- **Added disk directory failure detection and recovery**:
- When a disk directory is found to be inaccessible during initial access,
the system now reports the directory as abnormal and automatically attempts to
fetch a new available directory.
- This resolves the issue of single-disk directory failures in multi-disk
environments.
### Verification
- **Tested in a 3N3C environment** (refer to [Feishu
Doc](https://timechor.feishu.cn/docx/Cgi1dMLhfovBs9xqK0dc1P0VnVe)).
- **Behavior**:
- IoTDB now logs directory access failures (see example logs below) and
successfully retries with a healthy directory.
### Example Log Output
```plaintext
(Now changed to warning-level logging)2025-06-09 10:13:34,500
[pool-33-IoTDB-DataNodeInternalRPC-Processor-24] ERROR
o.a.i.d.s.d.w.a.AbstractNodeAllocationStrategy:72 - Meet exception when
creating wal node
java.io.FileNotFoundException:
/root/apache-iotdb-2.0.4-SNAPSHOT-all-bin/data/datanode/wal3/root.db1.g_0-7/_0.checkpoint
(No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at
org.apache.iotdb.db.storageengine.dataregion.wal.io.LogWriter.<init>(LogWriter.java:70)
at
org.apache.iotdb.db.storageengine.dataregion.wal.io.CheckpointWriter.<init>(CheckpointWriter.java:30)
at
org.apache.iotdb.db.storageengine.dataregion.wal.checkpoint.CheckpointManager.<init>(CheckpointManager.java:90)
at
org.apache.iotdb.db.storageengine.dataregion.wal.node.WALNode.<init>(WALNode.java:129)
at
org.apache.iotdb.db.storageengine.dataregion.wal.node.WALNode.<init>(WALNode.java:118)
at
org.apache.iotdb.db.storageengine.dataregion.wal.allocation.AbstractNodeAllocationStrategy.createWALNode(AbstractNodeAllocationStrategy.java:70)
at
org.apache.iotdb.db.storageengine.dataregion.wal.allocation.FirstCreateStrategy.applyForWALNode(FirstCreateStrategy.java:56)
at
org.apache.iotdb.db.storageengine.dataregion.wal.WALManager.applyForWALNode(WALManager.java:100)
at
org.apache.iotdb.db.storageengine.dataregion.DataRegion.getWALNode(DataRegion.java:3840)
at
org.apache.iotdb.db.consensus.statemachine.dataregion.DataRegionStateMachine.read(DataRegionStateMachine.java:242)
at
org.apache.iotdb.consensus.iot.IoTConsensusServerImpl.<init>(IoTConsensusServerImpl.java:150)
at
org.apache.iotdb.consensus.iot.IoTConsensus.lambda$createLocalPeer$8(IoTConsensus.java:286)
at
java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
at
org.apache.iotdb.consensus.iot.IoTConsensus.createLocalPeer(IoTConsensus.java:269)
at
org.apache.iotdb.db.protocol.thrift.impl.DataNodeRegionManager.createDataRegion(DataNodeRegionManager.java:157)
at
org.apache.iotdb.db.protocol.thrift.impl.DataNodeInternalRPCServiceImpl.createDataRegion(DataNodeInternalRPCServiceImpl.java:562)
at
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$createDataRegion.getResult(IDataNodeRPCService.java:6511)
at
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$createDataRegion.getResult(IDataNodeRPCService.java:6491)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2025-06-09 10:13:34,735 [pool-33-IoTDB-DataNodeInternalRPC-Processor-12]
WARN o.a.i.d.s.d.t.g.TsFileNameGenerator:120 - Failed to process folder
[tierLevel=0, sequence=true,
baseDir=/root/apache-iotdb-2.0.4-SNAPSHOT-all-bin/data/datanode/data3/sequence],
state set to ABNORMAL
2025-06-09 10:13:34,735 [pool-33-IoTDB-DataNodeInternalRPC-Processor-13]
WARN o.a.i.d.s.d.t.g.TsFileNameGenerator:120 - Failed to process folder
[tierLevel=0, sequence=true,
baseDir=/root/apache-iotdb-2.0.4-SNAPSHOT-all-bin/data/datanode/data3/sequence],
state set to ABNORMAL
2025-06-09 10:13:34,735 [pool-33-IoTDB-DataNodeInternalRPC-Processor-7] WARN
o.a.i.d.s.d.t.g.TsFileNameGenerator:120 - Failed to process folder
[tierLevel=0, sequence=true,
baseDir=/root/apache-iotdb-2.0.4-SNAPSHOT-all-bin/data/datanode/data3/sequence],
state set to ABNORMAL
(Now adjusted to: {} is above the warning threshold, or not accessible,
free space {}, total space {})2025-06-09 10:13:35,215
[pool-24-IoTDB-IoTConsensusRPC-Processor-2] WARN
o.a.i.d.s.r.d.s.SequenceStrategy:70 -
/root/apache-iotdb-2.0.4-SNAPSHOT-all-bin/data/datanode/data3/sequence is above
the warning threshold, free space 118933000192, total space243640324096
2025-06-09 10:13:55,833 [pool-24-IoTDB-IoTConsensusRPC-Processor-7] WARN
o.a.i.d.s.d.t.g.TsFileNameGenerator:120 - Failed to process folder
[tierLevel=0, sequence=false,
baseDir=/root/apache-iotdb-2.0.4-SNAPSHOT-all-bin/data/datanode/data3/unsequence],
state set to ABNORMAL
```
---
### Key Changed Classes/Packages
- `FolderManager` (core logic for directory failover)
- Also includes the relevant code that calls
`org.apache.iotdb.db.storageengine.rescon.disk.FolderManager#getNextFolder`
---
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]