Yeah, I'm handling all three cases (Master, RegionServer, Client) in the same code. We could just let the Master/RegionServer fail on the SessionExpired and have the user clean it up? but that seems ugly since it is something we can handle.
On Mon, Mar 23, 2009 at 11:10 PM, Ryan Rawson <[email protected]> wrote: > My issue i originally complained about was from the _clients_ point of view > who doesnt actually create ephemeral nodes. > > But the other problems stand. > > On Mon, Mar 23, 2009 at 11:01 PM, Nitay Joffe (JIRA) <[email protected] > >wrote: > > > > > [ > > > https://issues.apache.org/jira/browse/HBASE-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688578#action_12688578 > ] > > > > Nitay Joffe commented on HBASE-1232: > > ------------------------------------ > > > > When a SessionExpired occurs we will lose our ephemeral nodes. This means > > everyone else in the cluster will think that node is down. To fix this we > > need to restart the node completely. > > > > For example, if the master's connection to ZooKeeper throws > SessionExpired > > it loses its ephemeral address node in ZooKeeper and everyone will think > the > > master has died. In fact, another master may come up now that we have the > HA > > master lock. > > > > I'm writing the #restart() methods for HMaster and HRegionServer. > > Effectively it's just something like: > > > > {code} > > shutdown(); > > run(); > > {code} > > > > I notice that the shutdown/stop methods in those classes just set a flag > > which is later picked up and causes a shutdown. How do I make sure the > > server is actually shutdown between the shutdown() call and the run() > call? > > > > > zookeeper client wont reconnect if there is a problem > > > ----------------------------------------------------- > > > > > > Key: HBASE-1232 > > > URL: https://issues.apache.org/jira/browse/HBASE-1232 > > > Project: Hadoop HBase > > > Issue Type: Bug > > > Environment: java 1.7, zookeeper 3.0.1 > > > Reporter: ryan rawson > > > Assignee: Nitay Joffe > > > Priority: Critical > > > Fix For: 0.20.0 > > > > > > > > > my regionserver got wedged: > > > 2009-03-02 15:43:30,938 WARN > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to create > /hbase: > > > org.apache.zookeeper.KeeperException$SessionExpiredException: > > KeeperErrorCode = Session expired for /hbase > > > at > > org.apache.zookeeper.KeeperException.create(KeeperException.java:87) > > > at > > org.apache.zookeeper.KeeperException.create(KeeperException.java:35) > > > at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:482) > > > at > > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:219) > > > at > > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:240) > > > at > > > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.checkOutOfSafeMode(ZooKeeperWrapper.java:328) > > > at > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:783) > > > at > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:468) > > > at > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:443) > > > at > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:518) > > > at > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:477) > > > at > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:450) > > > at > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocation(HConnectionManager.java:295) > > > at > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocationForRowWithRetries(HConnectionManager.java:919) > > > at > > > org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:950) > > > at > > org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1370) > > > at > org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1314) > > > at > org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1294) > > > at > > org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:237) > > > at > > org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:216) > > > at > > > org.apache.hadoop.hbase.RegionHistorian.addRegionSplit(RegionHistorian.java:174) > > > at > > > org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:607) > > > at > > > org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:174) > > > at > > > org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:107) > > > this message repeats over and over. > > > Looking at the code in question: > > > private boolean ensureExists(final String znode) { > > > try { > > > zooKeeper.create(znode, new byte[0], > > > Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT); > > > LOG.debug("Created ZNode " + znode); > > > return true; > > > } catch (KeeperException.NodeExistsException e) { > > > return true; // ok, move on. > > > } catch (KeeperException.NoNodeException e) { > > > return ensureParentExists(znode) && ensureExists(znode); > > > } catch (KeeperException e) { > > > LOG.warn("Failed to create " + znode + ":", e); > > > } catch (InterruptedException e) { > > > LOG.warn("Failed to create " + znode + ":", e); > > > } > > > return false; > > > } > > > We need to catch this exception specifically and reopen the ZK > > connection. > > > > -- > > This message is automatically generated by JIRA. > > - > > You can reply to this email to add a comment to the issue online. > > > > >
