I opened HBASE-2342 to discuss the watchdog node concept.

-Todd

On Wed, Mar 17, 2010 at 2:59 PM, Todd Lipcon <t...@cloudera.com> wrote:

> Hi Ryan,
>
> I think the idea of a secondary watchdog node is a decent one, but as you
> mentioned, it isn't a solution for the problem at hand. The RC pause
> exacerbates the problem, but network blips, etc, can cause the same problem.
>
> Is there a JIRA open for the watchdog process? I think we should discuss it
> separately. A few weeks I had proposed on IRC the ridiculously named
> "SeppukuNode" which is a similar but not quite the same idea - we should
> hash those out on JIRA.
>
> -Todd
>
>
> On Wed, Mar 17, 2010 at 11:38 AM, Ryan Rawson <ryano...@gmail.com> wrote:
>
>> There are 2 ways to lose your ZK session:
>>
>> - you dont send pings back to ZK and it expires it (GC pause of death,
>> network disconnect, etc)
>> - ZK "somehow" expires your session for you. I have seen this once in
>> a while, its rare, but painful when it happens. It didn't seem to be
>> correlated to GC pause at the time.
>>
>> So here is the proposal in full:
>> - RegionServerWatcher starts the ZK pingback, and exists to listen for
>> termination notifications from RegionServer (via good old fashioned OS
>> primitives).
>> - RSW keeps the ZK node up. Keeps tabs on it's child, perhaps checking
>> ports, or whatnot.
>> - If RS dies, RSW kills the ZK emphermial node. No race conditions
>> because the log append terminates before the master takes action
>> (which it does only after the ZK notification comes in).
>> - If a RS goes into a long GC pause, the RSW can decide to wait it out
>> or kill -9 the RS and release the HLog. Again no race condition for
>> the previous reason.
>> - If a network outage takes the node out, this is where a race
>> condition could occur.  In which case, Option #1 seems super clean and
>> awesome. It also has the advantage of being really easy to understand
>> (always a plus at 2am).
>>
>> The overall advantage of my proposal is we can tune down the ZK
>> timeout to something really small.  Like 10 seconds. That way when
>> network events take a node out of service, we can detect and respond
>> much faster.  Also with a separate process we now have the ability to
>> react instantly to crashes without waiting for a timeout. A
>> disadvantage is more moving parts, but we can probably abstract this
>> away cleanly.
>>
>> One last thought - if we have a 10 second timeout and we have a
>> network partition, we will see a cascade of failed regionservers.
>> Considering that the individual RS may not be able to proceed anyways
>> (they might have been cut off from too many datanodes to log or read
>> hfiles), this might be inevitable.  Obviously this means running HBase
>> across a WAN is right out (we always knew that, right?), but this is
>> why we are doing replication.
>>
>> On Wed, Mar 17, 2010 at 10:55 AM, Todd Lipcon <t...@cloudera.com> wrote:
>> > On Wed, Mar 17, 2010 at 10:48 AM, Ryan Rawson <ryano...@gmail.com>
>> wrote:
>> >
>> >> I have a 4th option :-)  I'm on the his right now and ill write it up
>> when
>> >> I
>> >> get to work. In short move the zk thread out of the rs into a
>> monitoring
>> >> parent and then you can explicitly monitor for Juliet gc pauses. More
>> to
>> >> come....
>> >>
>> >
>> > I don't think that will be correct - it might be mostly correct, but
>> "Juliet
>> > gc pauses" are just an extra long version of what happens all the time.
>> ZK
>> > is asynchronous, so we will never find out immediately if we've been
>> killed.
>> > There can always be an arbitrarily long pause in between looking at ZK
>> state
>> > and taking an action.
>> >
>> > -Todd
>> >
>> >
>> >>
>> >> On Mar 17, 2010 10:22 AM, "Karthik Ranganathan" <
>> kranganat...@facebook.com
>> >> >
>> >> wrote:
>> >>
>> >> Loved the "Juliet" terminology as well :).
>> >>
>> >> @Todd: I agree we will need something like #2 or especially #3 in other
>> >> places.
>> >>
>> >> Looks like we have a consensus - I will update the JIRA.
>> >>
>> >>
>> >> Thanks
>> >> Karthik
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: Todd Lipcon [mailto:t...@cloudera.com]
>> >>
>> >> Sent: Tuesday, March 16, 2010 10:09 PM
>> >> To: hbase-dev@hadoop.apache.org
>> >> Subject: Re: HBASE-2312 discu...
>> >>
>> >> On Tue, Mar 16, 2010 at 8:59 PM, Stack <st...@duboce.net> wrote:
>> >>
>> >> > On Tue, Mar 16, 2010 at 5:08 PM,...
>> >>
>> >
>> >
>> >
>> > --
>> > Todd Lipcon
>> > Software Engineer, Cloudera
>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to