My issue i originally complained about was from the _clients_ point of view who doesnt actually create ephemeral nodes.
But the other problems stand. On Mon, Mar 23, 2009 at 11:01 PM, Nitay Joffe (JIRA) <[email protected]>wrote: > > [ > https://issues.apache.org/jira/browse/HBASE-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688578#action_12688578] > > Nitay Joffe commented on HBASE-1232: > ------------------------------------ > > When a SessionExpired occurs we will lose our ephemeral nodes. This means > everyone else in the cluster will think that node is down. To fix this we > need to restart the node completely. > > For example, if the master's connection to ZooKeeper throws SessionExpired > it loses its ephemeral address node in ZooKeeper and everyone will think the > master has died. In fact, another master may come up now that we have the HA > master lock. > > I'm writing the #restart() methods for HMaster and HRegionServer. > Effectively it's just something like: > > {code} > shutdown(); > run(); > {code} > > I notice that the shutdown/stop methods in those classes just set a flag > which is later picked up and causes a shutdown. How do I make sure the > server is actually shutdown between the shutdown() call and the run() call? > > > zookeeper client wont reconnect if there is a problem > > ----------------------------------------------------- > > > > Key: HBASE-1232 > > URL: https://issues.apache.org/jira/browse/HBASE-1232 > > Project: Hadoop HBase > > Issue Type: Bug > > Environment: java 1.7, zookeeper 3.0.1 > > Reporter: ryan rawson > > Assignee: Nitay Joffe > > Priority: Critical > > Fix For: 0.20.0 > > > > > > my regionserver got wedged: > > 2009-03-02 15:43:30,938 WARN > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to create /hbase: > > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired for /hbase > > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:87) > > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:35) > > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:482) > > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:219) > > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:240) > > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.checkOutOfSafeMode(ZooKeeperWrapper.java:328) > > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:783) > > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:468) > > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:443) > > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:518) > > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:477) > > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:450) > > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocation(HConnectionManager.java:295) > > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocationForRowWithRetries(HConnectionManager.java:919) > > at > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:950) > > at > org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1370) > > at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1314) > > at org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1294) > > at > org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:237) > > at > org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:216) > > at > org.apache.hadoop.hbase.RegionHistorian.addRegionSplit(RegionHistorian.java:174) > > at > org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:607) > > at > org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:174) > > at > org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:107) > > this message repeats over and over. > > Looking at the code in question: > > private boolean ensureExists(final String znode) { > > try { > > zooKeeper.create(znode, new byte[0], > > Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT); > > LOG.debug("Created ZNode " + znode); > > return true; > > } catch (KeeperException.NodeExistsException e) { > > return true; // ok, move on. > > } catch (KeeperException.NoNodeException e) { > > return ensureParentExists(znode) && ensureExists(znode); > > } catch (KeeperException e) { > > LOG.warn("Failed to create " + znode + ":", e); > > } catch (InterruptedException e) { > > LOG.warn("Failed to create " + znode + ":", e); > > } > > return false; > > } > > We need to catch this exception specifically and reopen the ZK > connection. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > >
