[ 
https://issues.apache.org/jira/browse/HBASE-25612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298843#comment-17298843
 ] 

Rushabh Shah commented on HBASE-25612:
--------------------------------------

> ReplicationLogCleaner uses its own zk client session?

In master and branch-2, 
[ReplicationLogCleaner|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/master/ReplicationLogCleaner.java#L102-L107]
 shares zk session with HMaster.

But in branch-1, 
[ReplicationLogCleaner|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/master/ReplicationLogCleaner.java#L139]
 creates its own ZKWatcher object. 

[~anoop.hbase] We observed this behavior in cluster which is running hbase-1.3 
version. Looks like this bug exists only in branch-1

> HMaster should abort if ReplicationLogCleaner is not able to delete oldWALs.
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-25612
>                 URL: https://issues.apache.org/jira/browse/HBASE-25612
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 1.6.0
>            Reporter: Rushabh Shah
>            Assignee: Rushabh Shah
>            Priority: Major
>
> In our production cluster, we encountered an issue where the number of files 
> within /hbase/oldWALs directory were growing exponentially from about 4000 
> baseline to 150000 and growing at the rate of 333 files per minute.
> On further investigation we found that ReplicatonLogCleaner thread was 
> getting aborted since it was not able to talk to zookeeper. Stack trace below
> {noformat}
> 2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] zookeeper.ZKUtil - 
> replicationLogCleaner-0x3000002e05e0d8f, 
> quorum=zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181,zookeeper-4:2181,
>  baseZNode=/hbase Unable to get data of znode /hbase/replication/rs
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired for /hbase/replication/rs
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1229)
>  at 
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:374)
>  at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:713)
>  at 
> org.apache.hadoop.hbase.replication.ReplicationQueuesClientZKImpl.getQueuesZNodeCversion(ReplicationQueuesClientZKImpl.java:87)
>  at 
> org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.loadWALsFromQueues(ReplicationLogCleaner.java:99)
>  at 
> org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.getDeletableFiles(ReplicationLogCleaner.java:70)
>  at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:262)
>  at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.access$200(CleanerChore.java:52)
>  at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore$3.act(CleanerChore.java:413)
>  at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore$3.act(CleanerChore.java:410)
>  at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.deleteAction(CleanerChore.java:481)
>  at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.traverseAndDelete(CleanerChore.java:410)
>  at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.access$100(CleanerChore.java:52)
>  at 
> org.apache.hadoop.hbase.master.cleaner.CleanerChore$1.run(CleanerChore.java:220)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> 2021-02-25 23:05:01,149 WARN  [an-pool3-thread-1729] 
> master.ReplicationLogCleaner - ReplicationLogCleaner received abort, 
> ignoring.  Reason: Failed to get stat of replication rs node
> 2021-02-25 23:05:01,149 DEBUG [an-pool3-thread-1729] 
> master.ReplicationLogCleaner - 
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired for /hbase/replication/rs
> 2021-02-25 23:05:01,150 WARN  [an-pool3-thread-1729] 
> master.ReplicationLogCleaner - Failed to read zookeeper, skipping checking 
> deletable files
>  {noformat}
>  
> {quote} 2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] 
> master.ReplicationLogCleaner - ReplicationLogCleaner received abort, 
> ignoring. Reason: Failed to get stat of replication rs node
> {quote}
>  
> This line is more scary where HMaster invoked Abortable but just ignored and 
> HMaster was doing it business as usual.
> We have max files per directory configuration in namenode which is set to 1M 
> in our clusters. If this directory reached that limit then that would have 
> brought down the whole cluster.
> We shouldn't ignore Abortable and should crash the Hmaster if Abortable is 
> invoked.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to