[
https://issues.apache.org/jira/browse/HBASE-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105674#comment-13105674
]
Chris Trezzo commented on HBASE-3130:
-------------------------------------
@J-D
Thanks for the comments. I am attaching a first patch that I worked on with
Lars. There are a few things that are slightly different from your comments
above, but I figured I would post it anyways just to get the ball rolling. Let
me know what you think.
Here is an overview of the changes:
1. Added a new reloadZkWatcher method to ReplicationPeer. This method is
responsible for reseting the ReplicationPeer's ZookeeperWatcher when the
connection for the old watcher dies.
2. The crux of the change is in ReplicationSource.chooseSinks. We catch the
KeeperException there, and if it is a ConnectionLoss or SessionExpired we reset
that peer's ZookeeperWatcher using the new method from (1). Retries are handled
by the existing loops in the callers of chooseSinks (i.e. connectToPeers and
shipEdits).
3. ReplicationPeer now implements Abortable and is passed in as the Abortable
for the peer's ZookeeperWatcher. The abort method simply logs the fact that it
was called and returns.
I ran TestReplication, TestMasterReplication and TestMultiSlaveReplication
tests with no failures. I also manually tested the case where the slave cluster
fails, and confirmed that the ReplicationSource does recover and resumes
replication once the peer cluster comes back.
Thanks,
Chris
> [replication] ReplicationSource can't recover from session expired on remote
> clusters
> -------------------------------------------------------------------------------------
>
> Key: HBASE-3130
> URL: https://issues.apache.org/jira/browse/HBASE-3130
> Project: HBase
> Issue Type: Bug
> Components: replication
> Reporter: Jean-Daniel Cryans
>
> Currently ReplicationSource cannot recover when its zookeeper connection to
> its remote cluster expires. HLogs are still being tracked, but a cluster
> restart is required to continue replication (or a rolling restart).
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira