[
https://issues.apache.org/jira/browse/HBASE-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13602012#comment-13602012
]
Lars Hofhansl commented on HBASE-8099:
--------------------------------------
That works. Personally I'd probably just return queues in the first case and do
a clear() for the second like this:
{code}
- if (peerIdsToProcess == null) return null; // node already processed
+ if (peerIdsToProcess == null) return queues; // node already processed
...
LOG.warn("Got exception in copyQueuesFromRSUsingMulti: ", e);
+ queues.clear();
{code}
Maybe while we're add it, we could add a random jitter to the failover.
Add a Random member to ReplicationSourceManager and than do this in
NodeFailoverWorker:
{code}
- Thread.sleep(sleepBeforeFailover);
+ Thread.sleep(sleepBeforeFailover +
(long)(random.nextFloat()*sleepBeforeFailover));
{code}
> ReplicationZookeeper.copyQueuesFromRSUsingMulti should not return any queues
> if it failed to execute.
> -----------------------------------------------------------------------------------------------------
>
> Key: HBASE-8099
> URL: https://issues.apache.org/jira/browse/HBASE-8099
> Project: HBase
> Issue Type: Bug
> Reporter: Lars Hofhansl
> Assignee: Himanshu Vashishtha
> Priority: Blocker
> Fix For: 0.94.7
>
> Attachments: HBase-8099-94.patch, HBase-8099-94-v2.patch,
> HBase-8099-trunk-2.patch, HBase-8099-trunk.patch
>
>
> We just ran into an interesting scenario. We restarted a cluster that was
> setup as a replication source.
> The stop went cleanly.
> Upon restart *all* regionservers aborted within a few seconds with variations
> of these errors:
> http://pastebin.com/3iQVuBqS
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira