Apache9 commented on PR #7407: URL: https://github.com/apache/hbase/pull/7407#issuecomment-3473677733
> We're not setting `hbase.wal.provider`, so in theory we ought to be using `AsyncFSWAL`, but the below stacktrace suggests we're using `FSHLog`, which I can't explain. > > The problem we see in HMasters is this: > > ``` > 2025-10-29T18:36:31,234 [RpcServer.priority.RWQ.Fifo.read.handler=18,queue=1,port=60000] WARN org.apache.hadoop.hbase.master.MasterRpcServices: na1-few-cyan-clam.iad03.hubinternal.net,60020,1761675985630 reported a fatal error: > ***** ABORTING region server na1-few-cyan-clam.iad03.hubinternal.net,60020,1761675985630: Failed log close in log roller ***** > Cause: > org.apache.hadoop.hbase.regionserver.wal.FailedLogCloseException: hdfs://joke-hb2-a-qa:8020/hbase/WALs/na1-few-cyan-clam.iad03.hubinternal.net,60020,1761675985630/na1-few-cyan-clam.iad03.hubinternal.net%2C60020%2C1761675985630.1761761228611, unflushedEntries=7 > at org.apache.hadoop.hbase.regionserver.wal.FSHLog.doReplaceWriter(FSHLog.java:428) > at org.apache.hadoop.hbase.regionserver.wal.FSHLog.doReplaceWriter(FSHLog.java:69) > at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.lambda$replaceWriter$6(AbstractFSWAL.java:867) > at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216) > at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.replaceWriter(AbstractFSWAL.java:866) > at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriterInternal(AbstractFSWAL.java:922) > at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.lambda$rollWriter$8(AbstractFSWAL.java:952) > at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216) > at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:952) > at org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:305) > at org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:211) > Caused by: org.apache.hadoop.hbase.regionserver.wal.FailedSyncBeforeLogCloseException: org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append sequenceId=215213199, requesting roll of WAL > at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.checkIfSyncFailed(FSHLog.java:885) > at org.apache.hadoop.hbase.regionserver.wal.FSHLog$SafePointZigZagLatch.waitSafePoint(FSHLog.java:901) > at org.apache.hadoop.hbase.regionserver.wal.FSHLog.doReplaceWriter(FSHLog.java:365) > ... 10 more > Caused by: org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Append sequenceId=215213199, requesting roll of WAL > at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:1183) > at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1056) > at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:959) > at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168) > at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125) > at java.base/java.lang.Thread.run(Thread.java:1583) > Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[172.22.241.95:50010,DS-4007a5dc-1679-46c1-8e0f-31178d1c5c65,DISK], DatanodeInfoWithStorage[172.22.66.247:50010,DS-e060a81f-3473-4625-b81a-776dc93622c3,DISK]], original=[DatanodeInfoWithStorage[172.22.241.95:50010,DS-4007a5dc-1679-46c1-8e0f-31178d1c5c65,DISK], DatanodeInfoWithStorage[172.22.66.247:50010,DS-e060a81f-3473-4625-b81a-776dc93622c3,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration. > at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1358) > at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1426) > at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1652) > at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1553) > at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1535) > at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1311) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:671) > ``` > > Since several DataNodes have recently restarted, they get added to the DFSClient's bad node list, and once enough DataNodes are on the bad nodes list, there are none left to write to. This makes the HMaster restart. Seems we have already started a log roll here? Then I do not think issuing a log roll manually can fix the problem? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
