[
https://issues.apache.org/jira/browse/HBASE-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102204#comment-13102204
]
Chris Trezzo commented on HBASE-3130:
-------------------------------------
Hi all,
I am the "somebody at salesforce" that is currently looking at this issue. I
should have a sketch patch for you guys shortly.
Here is the general gist of what I am doing:
1. I am now specifically catching the SESSIONEXPIRED KeeperExceptions that are
thrown by methods which use a ZookeeperWatcher from ReplicationPeer.
2. I am adding a public method to ReplicationZookeeper that is responsible for
retrying/opening a new connection to the peer zookeeper cluster. This method
will take a ReplicationPeer (the old rp from the peer we are trying to
reconnect to) and an Abortable (which will be the ReplicationSource). It will
return a new ReplicationPeer with a fresh connection.
3. I will modify ReplicationSource so that it implements the Abortable
interface.
4. Currently I am going back and forth about two retry strategies:
A. There is a sleepMultiplier and a maxNumberOfRetries. The source will try to
reconnect until the maxNumberOfRetries is exceeded, after which it will abort.
B. Same as A, except after exceeding maxNumberOfRetries it will continue
retrying indefinitely, but at a very low frequency (i.e every 5 min). With this
approach I would implement it in a way such that replication could always be
turned off, at which point the retries would stop.
Thoughts/comments/suggestions are always appreciated. I am excited to
contribute!
Thanks,
Chris Trezzo
> [replication] ReplicationSource can't recover from session expired on remote
> clusters
> -------------------------------------------------------------------------------------
>
> Key: HBASE-3130
> URL: https://issues.apache.org/jira/browse/HBASE-3130
> Project: HBase
> Issue Type: Bug
> Components: replication
> Reporter: Jean-Daniel Cryans
>
> Currently ReplicationSource cannot recover when its zookeeper connection to
> its remote cluster expires. HLogs are still being tracked, but a cluster
> restart is required to continue replication (or a rolling restart).
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira