I opened HBASE-2342 to discuss the watchdog node concept. -Todd
On Wed, Mar 17, 2010 at 2:59 PM, Todd Lipcon <t...@cloudera.com> wrote: > Hi Ryan, > > I think the idea of a secondary watchdog node is a decent one, but as you > mentioned, it isn't a solution for the problem at hand. The RC pause > exacerbates the problem, but network blips, etc, can cause the same problem. > > Is there a JIRA open for the watchdog process? I think we should discuss it > separately. A few weeks I had proposed on IRC the ridiculously named > "SeppukuNode" which is a similar but not quite the same idea - we should > hash those out on JIRA. > > -Todd > > > On Wed, Mar 17, 2010 at 11:38 AM, Ryan Rawson <ryano...@gmail.com> wrote: > >> There are 2 ways to lose your ZK session: >> >> - you dont send pings back to ZK and it expires it (GC pause of death, >> network disconnect, etc) >> - ZK "somehow" expires your session for you. I have seen this once in >> a while, its rare, but painful when it happens. It didn't seem to be >> correlated to GC pause at the time. >> >> So here is the proposal in full: >> - RegionServerWatcher starts the ZK pingback, and exists to listen for >> termination notifications from RegionServer (via good old fashioned OS >> primitives). >> - RSW keeps the ZK node up. Keeps tabs on it's child, perhaps checking >> ports, or whatnot. >> - If RS dies, RSW kills the ZK emphermial node. No race conditions >> because the log append terminates before the master takes action >> (which it does only after the ZK notification comes in). >> - If a RS goes into a long GC pause, the RSW can decide to wait it out >> or kill -9 the RS and release the HLog. Again no race condition for >> the previous reason. >> - If a network outage takes the node out, this is where a race >> condition could occur. In which case, Option #1 seems super clean and >> awesome. It also has the advantage of being really easy to understand >> (always a plus at 2am). >> >> The overall advantage of my proposal is we can tune down the ZK >> timeout to something really small. Like 10 seconds. That way when >> network events take a node out of service, we can detect and respond >> much faster. Also with a separate process we now have the ability to >> react instantly to crashes without waiting for a timeout. A >> disadvantage is more moving parts, but we can probably abstract this >> away cleanly. >> >> One last thought - if we have a 10 second timeout and we have a >> network partition, we will see a cascade of failed regionservers. >> Considering that the individual RS may not be able to proceed anyways >> (they might have been cut off from too many datanodes to log or read >> hfiles), this might be inevitable. Obviously this means running HBase >> across a WAN is right out (we always knew that, right?), but this is >> why we are doing replication. >> >> On Wed, Mar 17, 2010 at 10:55 AM, Todd Lipcon <t...@cloudera.com> wrote: >> > On Wed, Mar 17, 2010 at 10:48 AM, Ryan Rawson <ryano...@gmail.com> >> wrote: >> > >> >> I have a 4th option :-) I'm on the his right now and ill write it up >> when >> >> I >> >> get to work. In short move the zk thread out of the rs into a >> monitoring >> >> parent and then you can explicitly monitor for Juliet gc pauses. More >> to >> >> come.... >> >> >> > >> > I don't think that will be correct - it might be mostly correct, but >> "Juliet >> > gc pauses" are just an extra long version of what happens all the time. >> ZK >> > is asynchronous, so we will never find out immediately if we've been >> killed. >> > There can always be an arbitrarily long pause in between looking at ZK >> state >> > and taking an action. >> > >> > -Todd >> > >> > >> >> >> >> On Mar 17, 2010 10:22 AM, "Karthik Ranganathan" < >> kranganat...@facebook.com >> >> > >> >> wrote: >> >> >> >> Loved the "Juliet" terminology as well :). >> >> >> >> @Todd: I agree we will need something like #2 or especially #3 in other >> >> places. >> >> >> >> Looks like we have a consensus - I will update the JIRA. >> >> >> >> >> >> Thanks >> >> Karthik >> >> >> >> >> >> -----Original Message----- >> >> From: Todd Lipcon [mailto:t...@cloudera.com] >> >> >> >> Sent: Tuesday, March 16, 2010 10:09 PM >> >> To: hbase-dev@hadoop.apache.org >> >> Subject: Re: HBASE-2312 discu... >> >> >> >> On Tue, Mar 16, 2010 at 8:59 PM, Stack <st...@duboce.net> wrote: >> >> >> >> > On Tue, Mar 16, 2010 at 5:08 PM,... >> >> >> > >> > >> > >> > -- >> > Todd Lipcon >> > Software Engineer, Cloudera >> > >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- Todd Lipcon Software Engineer, Cloudera