[
https://issues.apache.org/jira/browse/HBASE-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688578#action_12688578
]
Nitay Joffe edited comment on HBASE-1232 at 3/24/09 2:12 AM:
-------------------------------------------------------------
When a SessionExpired occurs we will lose our ephemeral nodes. This means
everyone else in the cluster will think that node is down. To fix this we need
to restart the node completely.
For example, if the master's connection to ZooKeeper throws SessionExpired it
loses its ephemeral address node in ZooKeeper and everyone will think the
master has died. In fact, another master may come up now that we have the HA
master lock.
was (Author: nitay):
When a SessionExpired occurs we will lose our ephemeral nodes. This means
everyone else in the cluster will think that node is down. To fix this we need
to restart the node completely.
For example, if the master's connection to ZooKeeper throws SessionExpired it
loses its ephemeral address node in ZooKeeper and everyone will think the
master has died. In fact, another master may come up now that we have the HA
master lock.
I'm writing the #restart() methods for HMaster and HRegionServer. Effectively
it's just something like:
{code}
shutdown();
run();
{code}
I notice that the shutdown/stop methods in those classes just set a flag which
is later picked up and causes a shutdown. How do I make sure the server is
actually shutdown between the shutdown() call and the run() call?
> zookeeper client wont reconnect if there is a problem
> -----------------------------------------------------
>
> Key: HBASE-1232
> URL: https://issues.apache.org/jira/browse/HBASE-1232
> Project: Hadoop HBase
> Issue Type: Bug
> Environment: java 1.7, zookeeper 3.0.1
> Reporter: ryan rawson
> Assignee: Nitay Joffe
> Priority: Critical
> Fix For: 0.20.0
>
>
> my regionserver got wedged:
> 2009-03-02 15:43:30,938 WARN
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to create /hbase:
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode
> = Session expired for /hbase
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:87)
> at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:35)
> at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:482)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:219)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:240)
> at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.checkOutOfSafeMode(ZooKeeperWrapper.java:328)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:783)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:468)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:443)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:518)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:477)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:450)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocation(HConnectionManager.java:295)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocationForRowWithRetries(HConnectionManager.java:919)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:950)
> at
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1370)
> at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1314)
> at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1294)
> at
> org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:237)
> at
> org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:216)
> at
> org.apache.hadoop.hbase.RegionHistorian.addRegionSplit(RegionHistorian.java:174)
> at
> org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:607)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:174)
> at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:107)
> this message repeats over and over.
> Looking at the code in question:
> private boolean ensureExists(final String znode) {
> try {
> zooKeeper.create(znode, new byte[0],
> Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
> LOG.debug("Created ZNode " + znode);
> return true;
> } catch (KeeperException.NodeExistsException e) {
> return true; // ok, move on.
> } catch (KeeperException.NoNodeException e) {
> return ensureParentExists(znode) && ensureExists(znode);
> } catch (KeeperException e) {
> LOG.warn("Failed to create " + znode + ":", e);
> } catch (InterruptedException e) {
> LOG.warn("Failed to create " + znode + ":", e);
> }
> return false;
> }
> We need to catch this exception specifically and reopen the ZK connection.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.