[ 
https://issues.apache.org/jira/browse/HBASE-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292528#comment-14292528
 ] 

churro morales commented on HBASE-10100:
----------------------------------------

Hi [~lhofhansl]

Yes I believe that was the issue, we were having DNS issues and some of our 
regionservers received UnknownHostExceptions and then the source shutdown.  So 
about 200 machines out of 1100 or so were not replicating for a while and we 
found this out weeks later when doing consistency checks.  

I am not very familiar with trunk, but looking at the code:
In HBaseInterClusterReplicationEndpoint.init() we do the following
this.conn = HConnectionManager.createConnection(this.conf);
and that connection is just passed off to 
the constructor of ReplicationSinkManager

Instead maybe we can lazily load the connection in ReplicationSinkManager and 
not even bother with it in HBaseInterClusterReplicationEndpoint since it 
doesn't really need it.  

That way ReplicationSourceManager.getReplicationSource() should not throw that 
Exception. 

[~stack] [~apurtell] what do you guys think?  I haven't had too much time to 
look at the code and I'm still trying to figure out a way to write some tests 
to reproduce this.  Since I am not as familiar with this code path as of yet, 
would love to hear your thoughts. 


> Hbase replication cluster can have varying peers under certain conditions
> -------------------------------------------------------------------------
>
>                 Key: HBASE-10100
>                 URL: https://issues.apache.org/jira/browse/HBASE-10100
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.94.5, 0.95.0, 0.96.0
>            Reporter: churro morales
>         Attachments: HBASE_10100-0.94.5.patch
>
>
> We were trying to replicate hbase data over to a new datacenter recently.  
> After we turned on replication and then did our copy tables.  We noticed that 
> verify replication had discrepancies.  
> We ran a list_peers and it returned back both peers, the original datacenter 
> we were replicating to and the new datacenter (this was correct).  
> When grepping through the logs for a few regionservers we noticed that a few 
> regionservers had the following entry in their logs:
> 2013-09-26 10:55:46,907 ERROR 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: 
> Error while adding a new peer java.net.UnknownHostException: 
> xxx.xxx.flurry.com (this was due to a transient dns issue)
> Thus a very small subet of our regionservers were not replicating to this new 
> cluster while most were. 
> We probably don't want to abort if this type of issue comes up, it could 
> potentially be fatal if someone does an "add_peer" operation with a typo.  
> This could potentially shut down the cluster. 
> One solution I can think of is keeping some flag in ReplicationSourceManager 
> which is a boolean that keeps track of whether there was an errorAddingPeer.  
> Then in the logPositionAndCleanOldLogs we can do something like:
> {code}
> if (errorAddingPeer) {
>       LOG.error("There was an error adding a peer, logs will not be marked 
> for deletion");
>       return;
>     }
> {code}
> thus we are not deleting these logs from the queue.  You will notice your 
> replicating queue rising on certain machines and you can still replay the 
> logs, thus avoiding a lengthy copy table. 
> I have a patch (with unit test) for the above proposal, if everyone thinks 
> that is an okay solution.
> An additional idea would be to add some retry logic inside the PeersWatcher 
> class for the nodeChildrenChanged method.  Thus if there happens to be some 
> issue we could sort it out without having to bounce that particular 
> regionserver.  
> Would love to hear everyones thoughts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to