[ 
https://issues.apache.org/jira/browse/HBASE-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105674#comment-13105674
 ] 

Chris Trezzo commented on HBASE-3130:
-------------------------------------

@J-D

Thanks for the comments. I am attaching a first patch that I worked on with 
Lars. There are a few things that are slightly different from your comments 
above, but I figured I would post it anyways just to get the ball rolling. Let 
me know what you think.

Here is an overview of the changes:

1. Added a new reloadZkWatcher method to ReplicationPeer. This method is 
responsible for reseting the ReplicationPeer's ZookeeperWatcher when the 
connection for the old watcher dies.

2. The crux of the change is in ReplicationSource.chooseSinks. We catch the 
KeeperException there, and if it is a ConnectionLoss or SessionExpired we reset 
that peer's ZookeeperWatcher using the new method from (1). Retries are handled 
by the existing loops in the callers of chooseSinks (i.e. connectToPeers and 
shipEdits).

3. ReplicationPeer now implements Abortable and is passed in as the Abortable 
for the peer's ZookeeperWatcher. The abort method simply logs the fact that it 
was called and returns.

I ran TestReplication, TestMasterReplication and TestMultiSlaveReplication 
tests with no failures. I also manually tested the case where the slave cluster 
fails, and confirmed that the ReplicationSource does recover and resumes 
replication once the peer cluster comes back.

Thanks,
Chris

> [replication] ReplicationSource can't recover from session expired on remote 
> clusters
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-3130
>                 URL: https://issues.apache.org/jira/browse/HBASE-3130
>             Project: HBase
>          Issue Type: Bug
>          Components: replication
>            Reporter: Jean-Daniel Cryans
>
> Currently ReplicationSource cannot recover when its zookeeper connection to 
> its remote cluster expires. HLogs are still being tracked, but a cluster 
> restart is required to continue replication (or a rolling restart).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to