Rushabh Shah created HBASE-25612:
------------------------------------

             Summary: HMaster should abort if ReplicationLogCleaner is not able 
to delete oldWALs.
                 Key: HBASE-25612
                 URL: https://issues.apache.org/jira/browse/HBASE-25612
             Project: HBase
          Issue Type: Improvement
    Affects Versions: 1.6.0
            Reporter: Rushabh Shah
            Assignee: Rushabh Shah


In our production cluster, we encountered an issue where the number of files 
within /hbase/oldWALs directory were growing exponentially from about 4000 
baseline to 150000 and growing at the rate of 333 files per minute.

On further investigation we found that ReplicatonLogCleaner thread was getting 
aborted since it was not able to talk to zookeeper. Stack trace below
{noformat}
2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] zookeeper.ZKUtil - 
replicationLogCleaner-0x3000002e05e0d8f, 
quorum=zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181,zookeeper-4:2181,
 baseZNode=/hbase Unable to get data of znode /hbase/replication/rs
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase/replication/rs
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
 at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1229)
 at 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:374)
 at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:713)
 at 
org.apache.hadoop.hbase.replication.ReplicationQueuesClientZKImpl.getQueuesZNodeCversion(ReplicationQueuesClientZKImpl.java:87)
 at 
org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.loadWALsFromQueues(ReplicationLogCleaner.java:99)
 at 
org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.getDeletableFiles(ReplicationLogCleaner.java:70)
 at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:262)
 at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.access$200(CleanerChore.java:52)
 at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore$3.act(CleanerChore.java:413)
 at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore$3.act(CleanerChore.java:410)
 at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.deleteAction(CleanerChore.java:481)
 at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.traverseAndDelete(CleanerChore.java:410)
 at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore.access$100(CleanerChore.java:52)
 at 
org.apache.hadoop.hbase.master.cleaner.CleanerChore$1.run(CleanerChore.java:220)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)

2021-02-25 23:05:01,149 WARN  [an-pool3-thread-1729] 
master.ReplicationLogCleaner - ReplicationLogCleaner received abort, ignoring.  
Reason: Failed to get stat of replication rs node
2021-02-25 23:05:01,149 DEBUG [an-pool3-thread-1729] 
master.ReplicationLogCleaner - 
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase/replication/rs
2021-02-25 23:05:01,150 WARN  [an-pool3-thread-1729] 
master.ReplicationLogCleaner - Failed to read zookeeper, skipping checking 
deletable files
 {noformat}
 

{quote} 2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] 
master.ReplicationLogCleaner - ReplicationLogCleaner received abort, ignoring. 
Reason: Failed to get stat of replication rs node
{quote}
 

This line is more scary where HMaster invoked Abortable but just ignored and 
HMaster was doing it business as usual.

We have max files per directory configuration in namenode which is set to 1M in 
our clusters. If this directory reached that limit then that would have brought 
down the whole cluster.

We shouldn't ignore Abortable and should crash the Hmaster if Abortable is 
invoked.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to