[ 
https://issues.apache.org/jira/browse/HBASE-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nitay Joffe updated HBASE-1232:
-------------------------------

    Attachment: hbase-1232.patch

The idea in this patch is to have the client HConnection, that is TableServers, 
watch for the SessionExpired event and do the right thing. After looking over 
the code a bit, I think the right thing to do is to clear out the 
ZooKeeperWrapper being used (that handle is now dead anyways) so that the next 
time getZooKeeperWrapper() is called it will instantiate a new handle.

Please particularly check if this introduces any concurrency issues. I think 
it's safe, but it'd be nice to get some validation.

In detail:
- Add getZooKeeperWrapper() to HConnection
- TableServers now implements Watcher.
- Add getSessionID() and getSessionPassword() to ZooKeeperWrapper to test 
SessionExpired.
- Add getQuorumPeers() to MiniZooKeeperCluster to get ZooKeeper quorum in tests.
- Added test that causes client's ZooKeeper session to expire. 

> zookeeper client wont reconnect if there is a problem
> -----------------------------------------------------
>
>                 Key: HBASE-1232
>                 URL: https://issues.apache.org/jira/browse/HBASE-1232
>             Project: Hadoop HBase
>          Issue Type: Bug
>         Environment: java 1.7, zookeeper 3.0.1
>            Reporter: ryan rawson
>            Assignee: Nitay Joffe
>            Priority: Critical
>             Fix For: 0.20.0
>
>         Attachments: hbase-1232.patch
>
>
> my regionserver got wedged:
> 2009-03-02 15:43:30,938 WARN 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to create /hbase:
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired for /hbase
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:87)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:35)
>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:482)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:219)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:240)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.checkOutOfSafeMode(ZooKeeperWrapper.java:328)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:783)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:468)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:443)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:518)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:477)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:450)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocation(HConnectionManager.java:295)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocationForRowWithRetries(HConnectionManager.java:919)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:950)
>         at 
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1370)
>         at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1314)
>         at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1294)
>         at 
> org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:237)
>         at 
> org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:216)
>         at 
> org.apache.hadoop.hbase.RegionHistorian.addRegionSplit(RegionHistorian.java:174)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:607)
>         at 
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:174)
>         at 
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:107)
> this message repeats over and over.  
> Looking at the code in question:
>   private boolean ensureExists(final String znode) {
>     try {
>       zooKeeper.create(znode, new byte[0],
>                        Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
>       LOG.debug("Created ZNode " + znode);
>       return true;
>     } catch (KeeperException.NodeExistsException e) {
>       return true;      // ok, move on.
>     } catch (KeeperException.NoNodeException e) {
>       return ensureParentExists(znode) && ensureExists(znode);
>     } catch (KeeperException e) {
>       LOG.warn("Failed to create " + znode + ":", e);
>     } catch (InterruptedException e) {
>       LOG.warn("Failed to create " + znode + ":", e);
>     }
>     return false;
>   }
> We need to catch this exception specifically and reopen the ZK connection.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to