Hi Ryan, I think the idea of a secondary watchdog node is a decent one, but as you mentioned, it isn't a solution for the problem at hand. The RC pause exacerbates the problem, but network blips, etc, can cause the same problem.
Is there a JIRA open for the watchdog process? I think we should discuss it separately. A few weeks I had proposed on IRC the ridiculously named "SeppukuNode" which is a similar but not quite the same idea - we should hash those out on JIRA. -Todd On Wed, Mar 17, 2010 at 11:38 AM, Ryan Rawson <ryano...@gmail.com> wrote: > There are 2 ways to lose your ZK session: > > - you dont send pings back to ZK and it expires it (GC pause of death, > network disconnect, etc) > - ZK "somehow" expires your session for you. I have seen this once in > a while, its rare, but painful when it happens. It didn't seem to be > correlated to GC pause at the time. > > So here is the proposal in full: > - RegionServerWatcher starts the ZK pingback, and exists to listen for > termination notifications from RegionServer (via good old fashioned OS > primitives). > - RSW keeps the ZK node up. Keeps tabs on it's child, perhaps checking > ports, or whatnot. > - If RS dies, RSW kills the ZK emphermial node. No race conditions > because the log append terminates before the master takes action > (which it does only after the ZK notification comes in). > - If a RS goes into a long GC pause, the RSW can decide to wait it out > or kill -9 the RS and release the HLog. Again no race condition for > the previous reason. > - If a network outage takes the node out, this is where a race > condition could occur. In which case, Option #1 seems super clean and > awesome. It also has the advantage of being really easy to understand > (always a plus at 2am). > > The overall advantage of my proposal is we can tune down the ZK > timeout to something really small. Like 10 seconds. That way when > network events take a node out of service, we can detect and respond > much faster. Also with a separate process we now have the ability to > react instantly to crashes without waiting for a timeout. A > disadvantage is more moving parts, but we can probably abstract this > away cleanly. > > One last thought - if we have a 10 second timeout and we have a > network partition, we will see a cascade of failed regionservers. > Considering that the individual RS may not be able to proceed anyways > (they might have been cut off from too many datanodes to log or read > hfiles), this might be inevitable. Obviously this means running HBase > across a WAN is right out (we always knew that, right?), but this is > why we are doing replication. > > On Wed, Mar 17, 2010 at 10:55 AM, Todd Lipcon <t...@cloudera.com> wrote: > > On Wed, Mar 17, 2010 at 10:48 AM, Ryan Rawson <ryano...@gmail.com> > wrote: > > > >> I have a 4th option :-) I'm on the his right now and ill write it up > when > >> I > >> get to work. In short move the zk thread out of the rs into a monitoring > >> parent and then you can explicitly monitor for Juliet gc pauses. More to > >> come.... > >> > > > > I don't think that will be correct - it might be mostly correct, but > "Juliet > > gc pauses" are just an extra long version of what happens all the time. > ZK > > is asynchronous, so we will never find out immediately if we've been > killed. > > There can always be an arbitrarily long pause in between looking at ZK > state > > and taking an action. > > > > -Todd > > > > > >> > >> On Mar 17, 2010 10:22 AM, "Karthik Ranganathan" < > kranganat...@facebook.com > >> > > >> wrote: > >> > >> Loved the "Juliet" terminology as well :). > >> > >> @Todd: I agree we will need something like #2 or especially #3 in other > >> places. > >> > >> Looks like we have a consensus - I will update the JIRA. > >> > >> > >> Thanks > >> Karthik > >> > >> > >> -----Original Message----- > >> From: Todd Lipcon [mailto:t...@cloudera.com] > >> > >> Sent: Tuesday, March 16, 2010 10:09 PM > >> To: hbase-dev@hadoop.apache.org > >> Subject: Re: HBASE-2312 discu... > >> > >> On Tue, Mar 16, 2010 at 8:59 PM, Stack <st...@duboce.net> wrote: > >> > >> > On Tue, Mar 16, 2010 at 5:08 PM,... > >> > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera