[jira] Issue Comment Edited: (HBASE-1232) zookeeper client wont reconnect if there is a problem

Nitay Joffe (JIRA) Tue, 24 Mar 2009 02:14:21 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688578#action_12688578
 ]


Nitay Joffe edited comment on HBASE-1232 at 3/24/09 2:12 AM:
-------------------------------------------------------------

When a SessionExpired occurs we will lose our ephemeral nodes. This means 
everyone else in the cluster will think that node is down. To fix this we need 
to restart the node completely.

For example, if the master's connection to ZooKeeper throws SessionExpired it 
loses its ephemeral address node in ZooKeeper and everyone will think the 
master has died. In fact, another master may come up now that we have the HA 
master lock.

      was (Author: nitay):
    When a SessionExpired occurs we will lose our ephemeral nodes. This means 
everyone else in the cluster will think that node is down. To fix this we need 
to restart the node completely.

For example, if the master's connection to ZooKeeper throws SessionExpired it 
loses its ephemeral address node in ZooKeeper and everyone will think the 
master has died. In fact, another master may come up now that we have the HA 
master lock.

I'm writing the #restart() methods for HMaster and HRegionServer. Effectively 
it's just something like:

{code}
  shutdown();
  run();
{code}

I notice that the shutdown/stop methods in those classes just set a flag which 
is later picked up and causes a shutdown. How do I make sure the server is 
actually shutdown between the shutdown() call and the run() call?
  
> zookeeper client wont reconnect if there is a problem
> -----------------------------------------------------
>
>                 Key: HBASE-1232
>                 URL: https://issues.apache.org/jira/browse/HBASE-1232
>             Project: Hadoop HBase
>          Issue Type: Bug
>         Environment: java 1.7, zookeeper 3.0.1
>            Reporter: ryan rawson
>            Assignee: Nitay Joffe
>            Priority: Critical
>             Fix For: 0.20.0
>
>
> my regionserver got wedged:
> 2009-03-02 15:43:30,938 WARN 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to create /hbase:
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired for /hbase
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:87)
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:35)
>         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:482)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:219)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:240)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.checkOutOfSafeMode(ZooKeeperWrapper.java:328)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:783)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:468)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:443)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:518)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:477)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:450)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocation(HConnectionManager.java:295)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocationForRowWithRetries(HConnectionManager.java:919)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:950)
>         at 
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1370)
>         at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1314)
>         at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1294)
>         at 
> org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:237)
>         at 
> org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:216)
>         at 
> org.apache.hadoop.hbase.RegionHistorian.addRegionSplit(RegionHistorian.java:174)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:607)
>         at 
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:174)
>         at 
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:107)
> this message repeats over and over.  
> Looking at the code in question:
>   private boolean ensureExists(final String znode) {
>     try {
>       zooKeeper.create(znode, new byte[0],
>                        Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
>       LOG.debug("Created ZNode " + znode);
>       return true;
>     } catch (KeeperException.NodeExistsException e) {
>       return true;      // ok, move on.
>     } catch (KeeperException.NoNodeException e) {
>       return ensureParentExists(znode) && ensureExists(znode);
>     } catch (KeeperException e) {
>       LOG.warn("Failed to create " + znode + ":", e);
>     } catch (InterruptedException e) {
>       LOG.warn("Failed to create " + znode + ":", e);
>     }
>     return false;
>   }
> We need to catch this exception specifically and reopen the ZK connection.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HBASE-1232) zookeeper client wont reconnect if there is a problem

Reply via email to