[ 
https://issues.apache.org/jira/browse/HBASE-29376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guluo resolved HBASE-29376.
---------------------------
    Fix Version/s: 3.0.0-beta-2
       Resolution: Fixed

> ReplicationLogCleaner.preClean/getDeletableFiles should return early when 
> asyncClusterConnection closes during HMaster stopping
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-29376
>                 URL: https://issues.apache.org/jira/browse/HBASE-29376
>             Project: HBase
>          Issue Type: Improvement
>          Components: master, Replication
>         Environment: HBase master
>            Reporter: guluo
>            Assignee: guluo
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.0.0-beta-2
>
>
> When HMaster is stopping, I found that hbase printed a lot of exception logs 
> (hbase.master.cleaner.interval = 10000(ms) or you can configure a smaller 
> time interval ), as follow.
> 2025-06-04T20:49:37,614 ERROR [master/hbase001:16000.Chore.2] 
> master.ReplicationLogCleaner: Error occurred while executing 
> queueStorage.hasData()
> org.apache.hadoop.hbase.replication.ReplicationException: failed to get 
> replication queue table
>         at 
> org.apache.hadoop.hbase.replication.TableReplicationQueueStorage.hasData(TableReplicationQueueStorage.java:538)
>  ~[hbase-replication-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.preClean(ReplicationLogCleaner.java:86)
>  ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
>         at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
>         at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.preRunCleaner(CleanerChore.java:282)
>  ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:257)
>  ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:161) 
> ~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
>         at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) 
> ~[?:?]
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
>  ~[?:?]
>         at 
> org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:107)
>  ~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>  ~[?:?]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>  ~[?:?]
>         at java.lang.Thread.run(Thread.java:833) ~[?:?]
> Caused by: org.apache.hadoop.hbase.ipc.StoppedRpcClientException: Call to 
> address=hbase001:16020 failed on local exception: 
> org.apache.hadoop.hbase.ipc.StoppedRpcClientException
>         at java.lang.Thread.getStackTrace(Thread.java:1610) ~[?:?]
>         at 
> org.apache.hadoop.hbase.util.FutureUtils.setStackTrace(FutureUtils.java:144) 
> ~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.util.FutureUtils.rethrow(FutureUtils.java:163) 
> ~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
>         at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:186) 
> ~[hbase-common-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.client.AdminOverAsyncAdmin.tableExists(AdminOverAsyncAdmin.java:130)
>  ~[hbase-client-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.replication.TableReplicationQueueStorage.hasData(TableReplicationQueueStorage.java:536)
>  ~[hbase-replication-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.preClean(ReplicationLogCleaner.java:86)
>  ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
>         at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
>  
> The reason.
> When the HMaster service enters its stopping phase, the ReplicationLogCleaner 
> task continues to execute periodically. During these executions, it invokes 
> the rpm.getQueueStorage().hasData() method to check for the existence of 
> pending data in the replication queue.
> However, once the HMaster service closes its asyncClusterConnection, we can 
> no longer properly retrieve replication queue data because the underlying RPC 
> client has been shut down at that point.
> So I think we should check if HMaster.asyncClusterConnection is closed in 
> ReplicationLogCleaner to ensure a graceful shutdown of hmaster



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to