guluo created HBASE-29376: ----------------------------- Summary: ReplicationLogCleaner.preClean/getDeletableFiles should return early when asyncClusterConnection closes during HMaster stopping Key: HBASE-29376 URL: https://issues.apache.org/jira/browse/HBASE-29376 Project: HBase Issue Type: Improvement Components: master, Replication Environment: HBase master Reporter: guluo
When HMaster is stopping, I found that hbase printed a lot of exception logs (hbase.master.cleaner.interval = 10000(ms) or you can configure a smaller time interval ), as follow. 2025-06-04T20:49:37,614 ERROR [master/hbase001:16000.Chore.2] master.ReplicationLogCleaner: Error occurred while executing queueStorage.hasData() org.apache.hadoop.hbase.replication.ReplicationException: failed to get replication queue table at org.apache.hadoop.hbase.replication.TableReplicationQueueStorage.hasData(TableReplicationQueueStorage.java:538) ~[hbase-replication-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT] at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.preClean(ReplicationLogCleaner.java:86) ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT] at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?] at org.apache.hadoop.hbase.master.cleaner.CleanerChore.preRunCleaner(CleanerChore.java:282) ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT] at org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:257) ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT] at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:161) ~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?] at org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:107) ~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?] at java.lang.Thread.run(Thread.java:833) ~[?:?] Caused by: org.apache.hadoop.hbase.ipc.StoppedRpcClientException: Call to address=hbase001:16020 failed on local exception: org.apache.hadoop.hbase.ipc.StoppedRpcClientException at java.lang.Thread.getStackTrace(Thread.java:1610) ~[?:?] at org.apache.hadoop.hbase.util.FutureUtils.setStackTrace(FutureUtils.java:144) ~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT] at org.apache.hadoop.hbase.util.FutureUtils.rethrow(FutureUtils.java:163) ~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT] at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:186) ~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT] at org.apache.hadoop.hbase.client.AdminOverAsyncAdmin.tableExists(AdminOverAsyncAdmin.java:130) ~[hbase-client-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT] at org.apache.hadoop.hbase.replication.TableReplicationQueueStorage.hasData(TableReplicationQueueStorage.java:536) ~[hbase-replication-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT] at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.preClean(ReplicationLogCleaner.java:86) ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT] at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?] The reason. When the HMaster service enters its stopping phase, the ReplicationLogCleaner task continues to execute periodically. During these executions, it invokes the rpm.getQueueStorage().hasData() method to check for the existence of pending data in the replication queue. However, once the HMaster service closes its asyncClusterConnection, we can no longer properly retrieve replication queue data because the underlying RPC client has been shut down at that point. So I think we should check if HMaster.asyncClusterConnection is closed in ReplicationLogCleaner to ensure a graceful shutdown of hmaster -- This message was sent by Atlassian Jira (v8.20.10#820010)