Andrew Purtell created HBASE-11922:
--------------------------------------
Summary: copyQueuesFromRSUsingMulti may fail to clean up properly
if zk.useMulti is true and there are orphaned queues
Key: HBASE-11922
URL: https://issues.apache.org/jira/browse/HBASE-11922
Project: HBase
Issue Type: Bug
Affects Versions: 0.98.6
Reporter: Andrew Purtell
Priority: Minor
To reproduce, set hbase.zookeeper.useMulti to true in site configuration, start
up an all-localhost cluster, create a table with a CF with a replication scope
of 1, add a peer (doesn't have to be a live endpoint), remove the peer, restart
the regionserver. Observe:
{noformat}
2014-09-09 13:39:23,497 WARN [ReplicationExecutor-0]
replication.ReplicationQueuesZKImpl: Got exception in
copyQueuesFromRSUsingMulti:
org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode =
Directory not empty
at org.apache.zookeeper.KeeperException.create(KeeperException.java:125)
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949)
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:620)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1530)
at
org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.copyQueuesFromRSUsingMulti(ReplicationQueuesZKImpl.java:335)
at
org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.claimQueues(ReplicationQueuesZKImpl.java:167)
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:520)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}
This is because there was an orphaned queue.
If ZK rolls back state after a failed multi (need to check, but let's assume so
for now), then other ops bundled into the multi-op by
copyQueuesFromRSUsingMulti will be rolled back, which might not be what we want.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)