Re: [jira] Commented: (HBASE-1232) zookeeper client wont reconnect if there is a problem

Nitay Mon, 23 Mar 2009 23:29:25 -0700

Yeah, I'm handling all three cases (Master, RegionServer, Client) in the
same code. We could just let the Master/RegionServer fail on the
SessionExpired and have the user clean it up? but that seems ugly since it
is something we can handle.


On Mon, Mar 23, 2009 at 11:10 PM, Ryan Rawson <[email protected]> wrote:

> My issue i originally complained about was from the _clients_ point of view
> who doesnt actually create ephemeral nodes.
>
> But the other problems stand.
>
> On Mon, Mar 23, 2009 at 11:01 PM, Nitay Joffe (JIRA) <[email protected]
> >wrote:
>
> >
> >    [
> >
> https://issues.apache.org/jira/browse/HBASE-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688578#action_12688578
> ]
> >
> > Nitay Joffe commented on HBASE-1232:
> > ------------------------------------
> >
> > When a SessionExpired occurs we will lose our ephemeral nodes. This means
> > everyone else in the cluster will think that node is down. To fix this we
> > need to restart the node completely.
> >
> > For example, if the master's connection to ZooKeeper throws
> SessionExpired
> > it loses its ephemeral address node in ZooKeeper and everyone will think
> the
> > master has died. In fact, another master may come up now that we have the
> HA
> > master lock.
> >
> > I'm writing the #restart() methods for HMaster and HRegionServer.
> > Effectively it's just something like:
> >
> > {code}
> >  shutdown();
> >  run();
> > {code}
> >
> > I notice that the shutdown/stop methods in those classes just set a flag
> > which is later picked up and causes a shutdown. How do I make sure the
> > server is actually shutdown between the shutdown() call and the run()
> call?
> >
> > > zookeeper client wont reconnect if there is a problem
> > > -----------------------------------------------------
> > >
> > >                 Key: HBASE-1232
> > >                 URL: https://issues.apache.org/jira/browse/HBASE-1232
> > >             Project: Hadoop HBase
> > >          Issue Type: Bug
> > >         Environment: java 1.7, zookeeper 3.0.1
> > >            Reporter: ryan rawson
> > >            Assignee: Nitay Joffe
> > >            Priority: Critical
> > >             Fix For: 0.20.0
> > >
> > >
> > > my regionserver got wedged:
> > > 2009-03-02 15:43:30,938 WARN
> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Failed to create
> /hbase:
> > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > KeeperErrorCode = Session expired for /hbase
> > >         at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:87)
> > >         at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:35)
> > >         at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:482)
> > >         at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureExists(ZooKeeperWrapper.java:219)
> > >         at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.ensureParentExists(ZooKeeperWrapper.java:240)
> > >         at
> >
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.checkOutOfSafeMode(ZooKeeperWrapper.java:328)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:783)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:468)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:443)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:518)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:477)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:450)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocation(HConnectionManager.java:295)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocationForRowWithRetries(HConnectionManager.java:919)
> > >         at
> >
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:950)
> > >         at
> > org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1370)
> > >         at
> org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1314)
> > >         at
> org.apache.hadoop.hbase.client.HTable.commit(HTable.java:1294)
> > >         at
> > org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:237)
> > >         at
> > org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:216)
> > >         at
> >
> org.apache.hadoop.hbase.RegionHistorian.addRegionSplit(RegionHistorian.java:174)
> > >         at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:607)
> > >         at
> >
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:174)
> > >         at
> >
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:107)
> > > this message repeats over and over.
> > > Looking at the code in question:
> > >   private boolean ensureExists(final String znode) {
> > >     try {
> > >       zooKeeper.create(znode, new byte[0],
> > >                        Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
> > >       LOG.debug("Created ZNode " + znode);
> > >       return true;
> > >     } catch (KeeperException.NodeExistsException e) {
> > >       return true;      // ok, move on.
> > >     } catch (KeeperException.NoNodeException e) {
> > >       return ensureParentExists(znode) && ensureExists(znode);
> > >     } catch (KeeperException e) {
> > >       LOG.warn("Failed to create " + znode + ":", e);
> > >     } catch (InterruptedException e) {
> > >       LOG.warn("Failed to create " + znode + ":", e);
> > >     }
> > >     return false;
> > >   }
> > > We need to catch this exception specifically and reopen the ZK
> > connection.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>

Re: [jira] Commented: (HBASE-1232) zookeeper client wont reconnect if there is a problem

Reply via email to