[ 
https://issues.apache.org/jira/browse/HBASE-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13116016#comment-13116016
 ] 

Jean-Daniel Cryans commented on HBASE-3130:
-------------------------------------------

Also I'd add that I was able to test the patch (on 0.90) and it really works, 
proof:

First it loses the connection:
{quote}
2011-09-27 16:44:54,984 WARN 
org.apache.hadoop.hbase.replication.ReplicationPeer: connection to cluster: 
10.10.30.7:2181:/hbase1-0x132ad0f29d70017 connection to cluster: 
10.10.30.7:2181:/hbase1-0x132ad0f29d70017 received expired from ZooKeeper, 
aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343)
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
2011-09-27 16:44:54,984 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
{quote}

Then later when it tries to replicate it tries to talk to ZK again and it works 
after a reload:

{quote}
2011-09-27 16:49:03,738 DEBUG 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Since we 
are unable to replicate, sleeping 1000 times 10
2011-09-27 16:49:13,738 WARN 
org.apache.hadoop.hbase.replication.ReplicationZookeeper: Lost the ZooKeeper 
connection for peer 1
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /hbase1/rs
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1243)
        at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:389)
        at 
org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndGetAsAddresses(ZKUtil.java:355)
        at 
org.apache.hadoop.hbase.replication.ReplicationZookeeper.fetchSlavesAddresses(ReplicationZookeeper.java:268)
        at 
org.apache.hadoop.hbase.replication.ReplicationZookeeper.getSlavesAddresses(ReplicationZookeeper.java:239)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.chooseSinks(ReplicationSource.java:205)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:588)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:341)
2011-09-27 16:49:13,772 INFO org.apache.zookeeper.ZooKeeper: Initiating client 
connection, connectString=10.10.30.7:2181 sessionTimeout=20000 
watcher=connection to cluster: 10.10.30.7:2181:/hbase1
2011-09-27 16:49:13,773 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server /10.10.30.7:2181
2011-09-27 16:49:14,111 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to hbasedev.sfo.stumble.net/10.10.30.7:2181, initiating session
2011-09-27 16:49:14,140 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server hbasedev.sfo.stumble.net/10.10.30.7:2181, 
sessionid = 0x132ad0f29d70024, negotiated timeout = 20000
{quote}
                
> [replication] ReplicationSource can't recover from session expired on remote 
> clusters
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-3130
>                 URL: https://issues.apache.org/jira/browse/HBASE-3130
>             Project: HBase
>          Issue Type: Bug
>          Components: replication
>    Affects Versions: 0.92.0
>            Reporter: Jean-Daniel Cryans
>            Assignee: Chris Trezzo
>             Fix For: 0.92.0
>
>         Attachments: 3130-v2.txt, 3130-v3.txt, 3130.txt
>
>
> Currently ReplicationSource cannot recover when its zookeeper connection to 
> its remote cluster expires. HLogs are still being tracked, but a cluster 
> restart is required to continue replication (or a rolling restart).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to