guluo created HBASE-29376:
-----------------------------

             Summary: ReplicationLogCleaner.preClean/getDeletableFiles should 
return early when asyncClusterConnection closes during HMaster stopping
                 Key: HBASE-29376
                 URL: https://issues.apache.org/jira/browse/HBASE-29376
             Project: HBase
          Issue Type: Improvement
          Components: master, Replication
         Environment: HBase master
            Reporter: guluo


When HMaster is stopping, I found that hbase printed a lot of exception logs 
(hbase.master.cleaner.interval = 10000(ms) or you can configure a smaller time 
interval ), as follow.

2025-06-04T20:49:37,614 ERROR [master/hbase001:16000.Chore.2] 
master.ReplicationLogCleaner: Error occurred while executing 
queueStorage.hasData()
org.apache.hadoop.hbase.replication.ReplicationException: failed to get 
replication queue table
        at 
org.apache.hadoop.hbase.replication.TableReplicationQueueStorage.hasData(TableReplicationQueueStorage.java:538)
 ~[hbase-replication-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
        at 
org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.preClean(ReplicationLogCleaner.java:86)
 ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
        at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
        at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.preRunCleaner(CleanerChore.java:282)
 ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
        at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:257)
 ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
        at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:161) 
~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) 
~[?:?]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
 ~[?:?]
        at 
org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:107)
 ~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) 
~[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) 
~[?:?]
        at java.lang.Thread.run(Thread.java:833) ~[?:?]
Caused by: org.apache.hadoop.hbase.ipc.StoppedRpcClientException: Call to 
address=hbase001:16020 failed on local exception: 
org.apache.hadoop.hbase.ipc.StoppedRpcClientException
        at java.lang.Thread.getStackTrace(Thread.java:1610) ~[?:?]
        at 
org.apache.hadoop.hbase.util.FutureUtils.setStackTrace(FutureUtils.java:144) 
~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
        at 
org.apache.hadoop.hbase.util.FutureUtils.rethrow(FutureUtils.java:163) 
~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
        at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:186) 
~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
        at 
org.apache.hadoop.hbase.client.AdminOverAsyncAdmin.tableExists(AdminOverAsyncAdmin.java:130)
 ~[hbase-client-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
        at 
org.apache.hadoop.hbase.replication.TableReplicationQueueStorage.hasData(TableReplicationQueueStorage.java:536)
 ~[hbase-replication-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
        at 
org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.preClean(ReplicationLogCleaner.java:86)
 ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
        at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]

 

The reason.
When the HMaster service enters its stopping phase, the ReplicationLogCleaner 
task continues to execute periodically. During these executions, it invokes the 
rpm.getQueueStorage().hasData() method to check for the existence of pending 
data in the replication queue.
However, once the HMaster service closes its asyncClusterConnection, we can no 
longer properly retrieve replication queue data because the underlying RPC 
client has been shut down at that point.

So I think we should check if HMaster.asyncClusterConnection is closed in 
ReplicationLogCleaner to ensure a graceful shutdown of hmaster



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to