[ 
https://issues.apache.org/jira/browse/HBASE-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847704#action_12847704
 ] 

stack commented on HBASE-2342:
------------------------------

I like the ideas in here, in particular the interception of GC start/stop and 
proactive rolling restart.

I agree that catching zk events in the babysitter then relaying them to the RS 
would be a horrorshow, especially where we are talking of all RS transitions 
going via zk.

I wrote the hotspot gc list to see if there is a way of getting forewarning a 
full serial GC is imminent or if we can trigger it at preordained time.  Lets 
see if I get anything back orther than consult with your vendor.

What about the idea that was discussed a good while back of a lightweight zk 
heartbeater that lived outside of the RS in native space done up in native 
code?  It'd be immune to GC pause.  It would keep up the session with zk 
(only).  It'd kill the RS if it lost its zk session -- perhaps its parent 
process on the RS -- and whereas the RS with its heavy-weight zk connection, 
once it'd revived, it'd just renew its session or, just have timeouts that were 
 hard to expire and carry on from where it'd left off.

While the above might run into long Juliet pauses, we'd log them.  GC tuning 
would help mitigate, at least some (though it seems occasional full serial GC 
is unavoidlable).


> Consider adding a watchdog node next to region server
> -----------------------------------------------------
>
>                 Key: HBASE-2342
>                 URL: https://issues.apache.org/jira/browse/HBASE-2342
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: regionserver
>            Reporter: Todd Lipcon
>
> This idea has been bandied about a fair amount. The concept is to add a 
> second java process that runs next to each region server to act as a 
> watchdog. Several possible purposes:
> - monitor the RS for liveness - if it exhibits Juliet syndrome ("appears 
> dead") then we kill it agressively to prevent it from coming back to life
> - restart RS automatically in failure cases
> - potentially move the entire ZK session to the watchdog to decouple node 
> liveness from the particular JVM liveness
> Let's discuss in this JIRA.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to