[ 
https://issues.apache.org/jira/browse/HBASE-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13842197#comment-13842197
 ] 

Jean-Marc Spaggiari commented on HBASE-10100:
---------------------------------------------

[~jdcryans] probably something you want to look at...

> Hbase replication cluster can have varying peers under certain conditions
> -------------------------------------------------------------------------
>
>                 Key: HBASE-10100
>                 URL: https://issues.apache.org/jira/browse/HBASE-10100
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.5, 0.95.0, 0.96.0
>            Reporter: churro morales
>
> We were trying to replicate hbase data over to a new datacenter recently.  
> After we turned on replication and then did our copy tables.  We noticed that 
> verify replication had discrepancies.  
> We ran a list_peers and it returned back both peers, the original datacenter 
> we were replicating to and the new datacenter (this was correct).  
> When grepping through the logs for a few regionservers we noticed that a few 
> regionservers had the following entry in their logs:
> 2013-09-26 10:55:46,907 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
> Error while adding a new peer java.net.UnknownHostException: 
> xxx.xxx.flurry.com (this was due to a transient dns issue)
> Thus a very small subet of our regionservers were not replicating to this new 
> cluster while most were. 
> We probably don't want to abort if this type of issue comes up, it could 
> potentially be fatal if someone does an "add_peer" operation with a typo.  
> This could potentially shut down the cluster. 
> One solution I can think of is keeping some flag in ReplicationSourceManager 
> which is a boolean that keeps track of whether there was an errorAddingPeer.  
> Then in the logPositionAndCleanOldLogs we can do something like:
> {code}
> if (errorAddingPeer) {
>       LOG.error("There was an error adding a peer, logs will not be marked 
> for deletion");
>       return;
>     }
> {code}
> thus we are not deleting these logs from the queue.  You will notice your 
> replicating queue rising on certain machines and you can still replay the 
> logs, thus avoiding a lengthy copy table. 
> I have a patch (with unit test) for the above proposal, if everyone thinks 
> that is an okay solution.
> An additional idea would be to add some retry logic inside the PeersWatcher 
> class for the nodeChildrenChanged method.  Thus if there happens to be some 
> issue we could sort it out without having to bounce that particular 
> regionserver.  
> Would love to hear everyones thoughts.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to