[
https://issues.apache.org/jira/browse/HBASE-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nitay Joffe updated HBASE-1232:
-------------------------------
Attachment: hbase-1232.patch
The idea in this patch is to have the client HConnection, that is TableServers,
watch for the SessionExpired event and do the right thing. After looking over
the code a bit, I think the right thing to do is to clear out the
ZooKeeperWrapper being used (that handle is now dead anyways) so that the next
time getZooKeeperWrapper() is called it will instantiate a new handle.
Please particularly check if this introduces any concurrency issues. I think
it's safe, but it'd be nice to get some validation.
In detail:
- Add getZooKeeperWrapper() to HConnection
- TableServers now implements Watcher.
- Add getSessionID() and getSessionPassword() to ZooKeeperWrapper to test
SessionExpired.
- Add getQuorumPeers() to MiniZooKeeperCluster to get ZooKeeper quorum in tests.
- Added test that causes client's ZooKeeper session to expire.
> zookeeper client wont reconnect if there is a problem
> -----------------------------------------------------
>
> Key: HBASE-1232
> URL: https://issues.apache.org/jira/browse/HBASE-1232
> Project: Hadoop HBase
> Issue Type: Bug
> Environment: java 1.7, zookeeper 3.0.1
> Reporter: ryan rawson
> Assignee: Nitay Joffe
> Priority: Critical
> Fix For: 0.20.0
>
> Attachments: hbase-1232.patch
>
>
> my regionserver got wedged:
> 2009-03-02 15:43:30,938 WARN
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to create /hbase:
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode
> = Session expired for /hbase
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:87)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:35)
> at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:482)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:219)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:240)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.checkOutOfSafeMode(ZooKeeperWrapper.java:328)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:783)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:468)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:443)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:518)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:477)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:450)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocation(HConnectionManager.java:295)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocationForRowWithRetries(HConnectionManager.java:919)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:950)
> at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1370)
> at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1314)
> at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1294)
> at
> org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:237)
> at
> org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:216)
> at
> org.apache.hadoop.hbase.RegionHistorian.addRegionSplit(RegionHistorian.java:174)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:607)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:174)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:107)
> this message repeats over and over.
> Looking at the code in question:
> private boolean ensureExists(final String znode) {
> try {
> zooKeeper.create(znode, new byte[0],
> Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
> LOG.debug("Created ZNode " + znode);
> return true;
> } catch (KeeperException.NodeExistsException e) {
> return true; // ok, move on.
> } catch (KeeperException.NoNodeException e) {
> return ensureParentExists(znode) && ensureExists(znode);
> } catch (KeeperException e) {
> LOG.warn("Failed to create " + znode + ":", e);
> } catch (InterruptedException e) {
> LOG.warn("Failed to create " + znode + ":", e);
> }
> return false;
> }
> We need to catch this exception specifically and reopen the ZK connection.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.