[ https://issues.apache.org/jira/browse/HBASE-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13842197#comment-13842197 ]
Jean-Marc Spaggiari commented on HBASE-10100: --------------------------------------------- [~jdcryans] probably something you want to look at... > Hbase replication cluster can have varying peers under certain conditions > ------------------------------------------------------------------------- > > Key: HBASE-10100 > URL: https://issues.apache.org/jira/browse/HBASE-10100 > Project: HBase > Issue Type: Bug > Affects Versions: 0.94.5, 0.95.0, 0.96.0 > Reporter: churro morales > > We were trying to replicate hbase data over to a new datacenter recently. > After we turned on replication and then did our copy tables. We noticed that > verify replication had discrepancies. > We ran a list_peers and it returned back both peers, the original datacenter > we were replicating to and the new datacenter (this was correct). > When grepping through the logs for a few regionservers we noticed that a few > regionservers had the following entry in their logs: > 2013-09-26 10:55:46,907 ERROR > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: > Error while adding a new peer java.net.UnknownHostException: > xxx.xxx.flurry.com (this was due to a transient dns issue) > Thus a very small subet of our regionservers were not replicating to this new > cluster while most were. > We probably don't want to abort if this type of issue comes up, it could > potentially be fatal if someone does an "add_peer" operation with a typo. > This could potentially shut down the cluster. > One solution I can think of is keeping some flag in ReplicationSourceManager > which is a boolean that keeps track of whether there was an errorAddingPeer. > Then in the logPositionAndCleanOldLogs we can do something like: > {code} > if (errorAddingPeer) { > LOG.error("There was an error adding a peer, logs will not be marked > for deletion"); > return; > } > {code} > thus we are not deleting these logs from the queue. You will notice your > replicating queue rising on certain machines and you can still replay the > logs, thus avoiding a lengthy copy table. > I have a patch (with unit test) for the above proposal, if everyone thinks > that is an okay solution. > An additional idea would be to add some retry logic inside the PeersWatcher > class for the nodeChildrenChanged method. Thus if there happens to be some > issue we could sort it out without having to bounce that particular > regionserver. > Would love to hear everyones thoughts. -- This message was sent by Atlassian JIRA (v6.1#6144)