There are 2 ways to lose your ZK session: - you dont send pings back to ZK and it expires it (GC pause of death, network disconnect, etc) - ZK "somehow" expires your session for you. I have seen this once in a while, its rare, but painful when it happens. It didn't seem to be correlated to GC pause at the time.
So here is the proposal in full: - RegionServerWatcher starts the ZK pingback, and exists to listen for termination notifications from RegionServer (via good old fashioned OS primitives). - RSW keeps the ZK node up. Keeps tabs on it's child, perhaps checking ports, or whatnot. - If RS dies, RSW kills the ZK emphermial node. No race conditions because the log append terminates before the master takes action (which it does only after the ZK notification comes in). - If a RS goes into a long GC pause, the RSW can decide to wait it out or kill -9 the RS and release the HLog. Again no race condition for the previous reason. - If a network outage takes the node out, this is where a race condition could occur. In which case, Option #1 seems super clean and awesome. It also has the advantage of being really easy to understand (always a plus at 2am). The overall advantage of my proposal is we can tune down the ZK timeout to something really small. Like 10 seconds. That way when network events take a node out of service, we can detect and respond much faster. Also with a separate process we now have the ability to react instantly to crashes without waiting for a timeout. A disadvantage is more moving parts, but we can probably abstract this away cleanly. One last thought - if we have a 10 second timeout and we have a network partition, we will see a cascade of failed regionservers. Considering that the individual RS may not be able to proceed anyways (they might have been cut off from too many datanodes to log or read hfiles), this might be inevitable. Obviously this means running HBase across a WAN is right out (we always knew that, right?), but this is why we are doing replication. On Wed, Mar 17, 2010 at 10:55 AM, Todd Lipcon <t...@cloudera.com> wrote: > On Wed, Mar 17, 2010 at 10:48 AM, Ryan Rawson <ryano...@gmail.com> wrote: > >> I have a 4th option :-) I'm on the his right now and ill write it up when >> I >> get to work. In short move the zk thread out of the rs into a monitoring >> parent and then you can explicitly monitor for Juliet gc pauses. More to >> come.... >> > > I don't think that will be correct - it might be mostly correct, but "Juliet > gc pauses" are just an extra long version of what happens all the time. ZK > is asynchronous, so we will never find out immediately if we've been killed. > There can always be an arbitrarily long pause in between looking at ZK state > and taking an action. > > -Todd > > >> >> On Mar 17, 2010 10:22 AM, "Karthik Ranganathan" <kranganat...@facebook.com >> > >> wrote: >> >> Loved the "Juliet" terminology as well :). >> >> @Todd: I agree we will need something like #2 or especially #3 in other >> places. >> >> Looks like we have a consensus - I will update the JIRA. >> >> >> Thanks >> Karthik >> >> >> -----Original Message----- >> From: Todd Lipcon [mailto:t...@cloudera.com] >> >> Sent: Tuesday, March 16, 2010 10:09 PM >> To: hbase-dev@hadoop.apache.org >> Subject: Re: HBASE-2312 discu... >> >> On Tue, Mar 16, 2010 at 8:59 PM, Stack <st...@duboce.net> wrote: >> >> > On Tue, Mar 16, 2010 at 5:08 PM,... >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera >