[
https://issues.apache.org/jira/browse/HBASE-25612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340177#comment-17340177
]
Rushabh Shah edited comment on HBASE-25612 at 5/6/21, 12:17 PM:
----------------------------------------------------------------
> Currently aborting master sounds aggressive, when backup master becomes
> active will it be aborted again?
We follow this same behavior in master and branch-2 also. In this patch we are
sharing the same zookeeper session as of hmaster instead of creating a new one
just for ReplicationLogCleaner thread.
In one of our biggest cluster we see this issue every 15-20 days. Due to
transient zookeeper issue ReplicationLogCleaner thread loses zk connection. We
have alerts in place which will let us know when the number of files within
oldWALs directory reaches certain threshold and we manually failover the
hmaster and then the oldWALs directory cleanup resumes again.
was (Author: shahrs87):
> Currently aborting master sounds aggressive, when backup master becomes
> active will it be aborted again?
We follow this same behavior in master and branch-2 also. In this patch we are
sharing the same zookeeper session as of hmaster instead of creating a new one
just for ReplicationLogCleaner thread.
> HMaster should abort if ReplicationLogCleaner is not able to delete oldWALs.
> ----------------------------------------------------------------------------
>
> Key: HBASE-25612
> URL: https://issues.apache.org/jira/browse/HBASE-25612
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 1.6.0
> Reporter: Rushabh Shah
> Assignee: Rushabh Shah
> Priority: Major
> Fix For: 1.7.0
>
>
> In our production cluster, we encountered an issue where the number of files
> within /hbase/oldWALs directory were growing exponentially from about 4000
> baseline to 150000 and growing at the rate of 333 files per minute.
> On further investigation we found that ReplicatonLogCleaner thread was
> getting aborted since it was not able to talk to zookeeper. Stack trace below
> {noformat}
> 2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729] zookeeper.ZKUtil -
> replicationLogCleaner-0x3000002e05e0d8f,
> quorum=zookeeper-0:2181,zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181,zookeeper-4:2181,
> baseZNode=/hbase Unable to get data of znode /hbase/replication/rs
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode
> = Session expired for /hbase/replication/rs
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
> at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1229)
> at
> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:374)
> at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:713)
> at
> org.apache.hadoop.hbase.replication.ReplicationQueuesClientZKImpl.getQueuesZNodeCversion(ReplicationQueuesClientZKImpl.java:87)
> at
> org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.loadWALsFromQueues(ReplicationLogCleaner.java:99)
> at
> org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.getDeletableFiles(ReplicationLogCleaner.java:70)
> at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:262)
> at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.access$200(CleanerChore.java:52)
> at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore$3.act(CleanerChore.java:413)
> at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore$3.act(CleanerChore.java:410)
> at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.deleteAction(CleanerChore.java:481)
> at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.traverseAndDelete(CleanerChore.java:410)
> at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.access$100(CleanerChore.java:52)
> at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore$1.run(CleanerChore.java:220)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729]
> master.ReplicationLogCleaner - ReplicationLogCleaner received abort,
> ignoring. Reason: Failed to get stat of replication rs node
> 2021-02-25 23:05:01,149 DEBUG [an-pool3-thread-1729]
> master.ReplicationLogCleaner -
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode
> = Session expired for /hbase/replication/rs
> 2021-02-25 23:05:01,150 WARN [an-pool3-thread-1729]
> master.ReplicationLogCleaner - Failed to read zookeeper, skipping checking
> deletable files
> {noformat}
>
> {quote} 2021-02-25 23:05:01,149 WARN [an-pool3-thread-1729]
> master.ReplicationLogCleaner - ReplicationLogCleaner received abort,
> ignoring. Reason: Failed to get stat of replication rs node
> {quote}
>
> This line is more scary where HMaster invoked Abortable but just ignored and
> HMaster was doing it business as usual.
> We have max files per directory configuration in namenode which is set to 1M
> in our clusters. If this directory reached that limit then that would have
> brought down the whole cluster.
> We shouldn't ignore Abortable and should crash the Hmaster if Abortable is
> invoked.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)