[ 
https://issues.apache.org/jira/browse/HBASE-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang resolved HBASE-14476.
-------------------------------
    Resolution: Duplicate

Fixed by HBASE-16135.

> ReplicationQueuesZKImpl#copyQueuesFromRSUsingMulti will fail if there are 
> orphaned queues under dead region server
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-14476
>                 URL: https://issues.apache.org/jira/browse/HBASE-14476
>             Project: HBase
>          Issue Type: Improvement
>          Components: Replication
>    Affects Versions: 2.0.0
>            Reporter: Jianwei Cui
>            Priority: Minor
>         Attachments: HBASE-14476-trunk-v1.patch
>
>
> ReplicationQueuesZKImpl#copyQueuesFromRSUsingMulti won't move the orphaned 
> queues under dead region 
> server([HBASE-12769|https://issues.apache.org/jira/browse/HBASE-12769] 
> describes situations orphaned queues tend to happen):
> {code}
>         if (!peerExists(replicationQueueInfo.getPeerId())) {
>           LOG.warn("Peer " + peerId + " didn't exist, skipping the replay");
>           // Protection against moving orphaned queues
>           continue;
>         }
> {code}
> After processing all the queues, the rsNode of dead region server will also 
> be deleted:
> {code}
>       // add delete op for dead rs, this will update the cversion of the 
> parent.
>       // The reader will make optimistic locking with this to get a consistent
>       // snapshot
>       listOfOps.add(ZKUtilOp.deleteNodeFailSilent(deadRSZnodePath));
>       ...
>       ZKUtil.multiOrSequential(this.zookeeper, listOfOps, false);
> {code} 
> If there are orphaned queues, the rsNode of dead region server is not empty, 
> so that the whole multi zookeeper operation will fail:
> {code}
> 2015-09-23 20:17:55,170 WARN  [ReplicationExecutor-0] 
> replication.ReplicationQueuesZKImpl: Got exception in 
> copyQueuesFromRSUsingMulti:
> org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode = 
> Directory not empty
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:125)
>   at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949)
>   at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
> {code}
> This fail will make the normal queues under the dead region server can not be 
> transferred if any orphaned queue exist. 
> In [HBASE-12865|https://issues.apache.org/jira/browse/HBASE-12865], 
> ReplicationLogCleaner will depend the cversion change of rsNode parent to 
> clean the WALs. Therefore, a possible solution is also transferring orphaned 
> queues from dead region server. These orphaned queues will be skipped in 
> ReplicationSourceManager$NodeFailoverWorker#run:
> {code}
>           try {
>             peerConfig = 
> replicationPeers.getReplicationPeerConfig(actualPeerId);
>           } catch (ReplicationException ex) {
>             LOG.warn("Received exception while getting replication peer 
> config, skipping replay"
>                 + ex);
>           }
>           if (peer == null || peerConfig == null) {
>             LOG.warn("Skipping failover for peer:" + actualPeerId + " of 
> node" + rsZnode);
>             continue;
>           }
> {code}
> This will make the orphaned queues also be kept in zookeeper with the queue 
> name containing the transfer histories(waiting for manual operation), and the 
> normal queues under the dead region server can also be processed. Suggestion 
> and discussion are welcomed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to