There are 2 ways to lose your ZK session:

- you dont send pings back to ZK and it expires it (GC pause of death,
network disconnect, etc)
- ZK "somehow" expires your session for you. I have seen this once in
a while, its rare, but painful when it happens. It didn't seem to be
correlated to GC pause at the time.

So here is the proposal in full:
- RegionServerWatcher starts the ZK pingback, and exists to listen for
termination notifications from RegionServer (via good old fashioned OS
primitives).
- RSW keeps the ZK node up. Keeps tabs on it's child, perhaps checking
ports, or whatnot.
- If RS dies, RSW kills the ZK emphermial node. No race conditions
because the log append terminates before the master takes action
(which it does only after the ZK notification comes in).
- If a RS goes into a long GC pause, the RSW can decide to wait it out
or kill -9 the RS and release the HLog. Again no race condition for
the previous reason.
- If a network outage takes the node out, this is where a race
condition could occur.  In which case, Option #1 seems super clean and
awesome. It also has the advantage of being really easy to understand
(always a plus at 2am).

The overall advantage of my proposal is we can tune down the ZK
timeout to something really small.  Like 10 seconds. That way when
network events take a node out of service, we can detect and respond
much faster.  Also with a separate process we now have the ability to
react instantly to crashes without waiting for a timeout. A
disadvantage is more moving parts, but we can probably abstract this
away cleanly.

One last thought - if we have a 10 second timeout and we have a
network partition, we will see a cascade of failed regionservers.
Considering that the individual RS may not be able to proceed anyways
(they might have been cut off from too many datanodes to log or read
hfiles), this might be inevitable.  Obviously this means running HBase
across a WAN is right out (we always knew that, right?), but this is
why we are doing replication.

On Wed, Mar 17, 2010 at 10:55 AM, Todd Lipcon <t...@cloudera.com> wrote:
> On Wed, Mar 17, 2010 at 10:48 AM, Ryan Rawson <ryano...@gmail.com> wrote:
>
>> I have a 4th option :-)  I'm on the his right now and ill write it up when
>> I
>> get to work. In short move the zk thread out of the rs into a monitoring
>> parent and then you can explicitly monitor for Juliet gc pauses. More to
>> come....
>>
>
> I don't think that will be correct - it might be mostly correct, but "Juliet
> gc pauses" are just an extra long version of what happens all the time. ZK
> is asynchronous, so we will never find out immediately if we've been killed.
> There can always be an arbitrarily long pause in between looking at ZK state
> and taking an action.
>
> -Todd
>
>
>>
>> On Mar 17, 2010 10:22 AM, "Karthik Ranganathan" <kranganat...@facebook.com
>> >
>> wrote:
>>
>> Loved the "Juliet" terminology as well :).
>>
>> @Todd: I agree we will need something like #2 or especially #3 in other
>> places.
>>
>> Looks like we have a consensus - I will update the JIRA.
>>
>>
>> Thanks
>> Karthik
>>
>>
>> -----Original Message-----
>> From: Todd Lipcon [mailto:t...@cloudera.com]
>>
>> Sent: Tuesday, March 16, 2010 10:09 PM
>> To: hbase-dev@hadoop.apache.org
>> Subject: Re: HBASE-2312 discu...
>>
>> On Tue, Mar 16, 2010 at 8:59 PM, Stack <st...@duboce.net> wrote:
>>
>> > On Tue, Mar 16, 2010 at 5:08 PM,...
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Reply via email to