[ 
https://issues.apache.org/jira/browse/HBASE-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102204#comment-13102204
 ] 

Chris Trezzo commented on HBASE-3130:
-------------------------------------

Hi all,

I am the "somebody at salesforce" that is currently looking at this issue. I 
should have a sketch patch for you guys shortly.

Here is the general gist of what I am doing:
1. I am now specifically catching the SESSIONEXPIRED KeeperExceptions that are 
thrown by methods which use a ZookeeperWatcher from ReplicationPeer.

2. I am adding a public method to ReplicationZookeeper that is responsible for 
retrying/opening a new connection to the peer zookeeper cluster. This method 
will take a ReplicationPeer (the old rp from the peer we are trying to 
reconnect to) and an Abortable (which will be the ReplicationSource). It will 
return a new ReplicationPeer with a fresh connection.

3. I will modify ReplicationSource so that it implements the Abortable 
interface.

4. Currently I am going back and forth about two retry strategies:
A. There is a sleepMultiplier and a maxNumberOfRetries. The source will try to 
reconnect until the maxNumberOfRetries is exceeded, after which it will abort.

B. Same as A, except after exceeding maxNumberOfRetries it will continue 
retrying indefinitely, but at a very low frequency (i.e every 5 min). With this 
approach I would implement it in a way such that replication could always be 
turned off, at which point the retries would stop.

Thoughts/comments/suggestions are always appreciated. I am excited to 
contribute!

Thanks,
Chris Trezzo

> [replication] ReplicationSource can't recover from session expired on remote 
> clusters
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-3130
>                 URL: https://issues.apache.org/jira/browse/HBASE-3130
>             Project: HBase
>          Issue Type: Bug
>          Components: replication
>            Reporter: Jean-Daniel Cryans
>
> Currently ReplicationSource cannot recover when its zookeeper connection to 
> its remote cluster expires. HLogs are still being tracked, but a cluster 
> restart is required to continue replication (or a rolling restart).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to