churro morales created HBASE-10100:
--------------------------------------
Summary: Hbase replication cluster can have varying peers under
certain conditions
Key: HBASE-10100
URL: https://issues.apache.org/jira/browse/HBASE-10100
Project: HBase
Issue Type: Bug
Affects Versions: 0.96.0, 0.95.0, 0.94.5
Reporter: churro morales
We were trying to replicate hbase data over to a new datacenter recently.
After we turned on replication and then did our copy tables. We noticed that
verify replication had discrepancies.
We ran a list_peers and it returned back both peers, the original datacenter we
were replicating to and the new datacenter (this was correct).
When grepping through the logs for a few regionservers we noticed that a few
regionservers had the following entry in their logs:
2013-09-26 10:55:46,907 ERROR
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Error while adding a new peer java.net.UnknownHostException: xxx.xxx.flurry.com
(this was due to a transient dns issue)
Thus a very small subet of our regionservers were not replicating to this new
cluster while most were.
We probably don't want to abort if this type of issue comes up, it could
potentially be fatal if someone does an "add_peer" operation with a typo. This
could potentially shut down the cluster.
One solution I can think of is keeping some flag in ReplicationSourceManager
which is a boolean that keeps track of whether there was an errorAddingPeer.
Then in the logPositionAndCleanOldLogs we can do something like:
{code}
if (errorAddingPeer) {
LOG.error("There was an error adding a peer, logs will not be marked for
deletion");
return;
}
{code}
thus we are not deleting these logs from the queue. You will notice your
replicating queue rising on certain machines and you can still replay the logs,
thus avoiding a lengthy copy table.
I have a patch (with unit test) for the above proposal, if everyone thinks that
is an okay solution.
An additional idea would be to add some retry logic inside the PeersWatcher
class for the nodeChildrenChanged method. Thus if there happens to be some
issue we could sort it out without having to bounce that particular
regionserver.
Would love to hear everyones thoughts.
--
This message was sent by Atlassian JIRA
(v6.1#6144)