charlesconnell commented on PR #7407:
URL: https://github.com/apache/hbase/pull/7407#issuecomment-3463215508
We're not setting `hbase.wal.provider`, so in theory we ought to be using
`AsyncFSWAL`, but the below stacktrace suggests we're using `FSHLog`, which I
can't explain.
The problem we see in HMasters is this:
```
2025-10-29T18:36:31,234
[RpcServer.priority.RWQ.Fifo.read.handler=18,queue=1,port=60000] WARN
org.apache.hadoop.hbase.master.MasterRpcServices:
na1-few-cyan-clam.iad03.hubinternal.net,60020,1761675985630 reported a fatal
error:
***** ABORTING region server
na1-few-cyan-clam.iad03.hubinternal.net,60020,1761675985630: Failed log close
in log roller *****
Cause:
org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException:
hdfs://joke-hb2-a-qa:8020/hbase/WALs/na1-few-cyan-clam.iad03.hubinternal.net,60020,1761675985630/na1-few-cyan-clam.iad03.hubinternal.net%2C60020%2C1761675985630.1761761228611,
unflushedEntries=7
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog.doReplaceWriter(FSHLog.java:428)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog.doReplaceWriter(FSHLog.java:69)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.lambda$replaceWriter$6(AbstractFSWAL.java:867)
at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.replaceWriter(AbstractFSWAL.java:866)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriterInternal(AbstractFSWAL.java:922)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.lambda$rollWriter$8(AbstractFSWAL.java:952)
at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
at
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:952)
at
org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:305)
at
org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:211)
Caused by:
org.apache.hadoop.hbase.regionserver.wal.FailedSyncBeforeLogCloseException:
org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append
sequenceId=215213199, requesting roll of WAL
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.checkIfSyncFailed(FSHLog.java:885)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:901)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog.doReplaceWriter(FSHLog.java:365)
... 10 more
Caused by: org.apache.hadoop.hbase.regionserver.wal.DamagedWALException:
Append sequenceId=215213199, requesting roll of WAL
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:1183)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1056)
at
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:959)
at
com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168)
at
com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.io.IOException: Failed to replace a bad datanode on the
existing pipeline due to no more good datanodes being available to try. (Nodes:
current=[DatanodeInfoWithStorage[172.22.241.95:50010,DS-4007a5dc-1679-46c1-8e0f-31178d1c5c65,DISK],
DatanodeInfoWithStorage[172.22.66.247:50010,DS-e060a81f-3473-4625-b81a-776dc93622c3,DISK]],
original=[DatanodeInfoWithStorage[172.22.241.95:50010,DS-4007a5dc-1679-46c1-8e0f-31178d1c5c65,DISK],
DatanodeInfoWithStorage[172.22.66.247:50010,DS-e060a81f-3473-4625-b81a-776dc93622c3,DISK]]).
The current failed datanode replacement policy is DEFAULT, and a client may
configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy'
in its configuration.
at
org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1358)
at
org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1426)
at
org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1652)
at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1553)
at
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1535)
at
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1311)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:671)
```
Since several DataNodes have recently restarted, they get added to the
DFSClient's bad node list, and once enough DataNodes are on the bad nodes list,
there are none left to write to.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]