[jira] [Commented] (HBASE-12386) Replication gets stuck following a transient zookeeper error to remote peer cluster

Adrian Muraru (JIRA) Thu, 30 Oct 2014 13:28:01 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-12386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190767#comment-14190767
 ]


Adrian Muraru commented on HBASE-12386:
---------------------------------------

Looking at the code it seems that once the remote zk peers lookup fails, the 
refresh ts is updated and the return list of RS peers is empty.

Next time 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSinkManager does 
not retry the lookup on the next polling as the following condition is not met:
{code:java}
if (endpoint.getLastRegionServerUpdate() > this.lastUpdateToPeers) {
      LOG.info("Current list of sinks is out of date, updating");
      chooseSinks();
}
{code}

A fix would be to force a refresh when the list of peers is empty:
{code:java}
if (replicationPeers.getTimestampOfLastChangeToPeer(peerClusterId) > 
this.lastUpdateToPeers
            || sinks.isEmpty()) {
      LOG.info("Current list of sinks is out of date or empty, updating");
      chooseSinks();
}
{code}

Note that this is not reproducing in 0.94 where it seems the refresh is 
happening in this case.


> Replication gets stuck following a transient zookeeper error to remote peer 
> cluster
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-12386
>                 URL: https://issues.apache.org/jira/browse/HBASE-12386
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 0.98.7
>            Reporter: Adrian Muraru
>
> Following a transient ZK error replication gets stuck and remote peers are 
> never updated.
> Source region servers are reporting continuously the following error in logs:
> "No replication sinks are available"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12386) Replication gets stuck following a transient zookeeper error to remote peer cluster

Reply via email to