[ 
https://issues.apache.org/jira/browse/HBASE-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994394#comment-12994394
 ] 

Jonathan Gray commented on HBASE-1502:
--------------------------------------

Got it.  That sounds like a good start and removing the heartbeat/HMsg is far 
more critical than the startup RPC.  I remember some issue in the past where it 
was odd that discovery was RPC but failure was ZK... but don't recall what 
exactly it was now.

+1 on ZK content being JSON serialized.

Just to bring it up since it's loosely related to this stuff, I'm of the 
opinion that a second monitor process is eventually going to be necessary.  ZK 
timeouts are just too high and there are many cases where if we could take GC 
pauses out of the equation, we could have much faster failure detection.  A 
second process that would not have any GC issues could have an ephemeral node 
with a much smaller timeout, or it could monitor the RS process and system 
itself.  I don't see another way towards reducing failure detection times 
without triggering false-positives when the RS is in a GC pause (a 
"recoverable" fault).

On an even more unrelated note, we could have some kind of metric (or this 
other process could figure out) how often GC pauses are happening / for how 
long (either through a looping sleep() thread or an RPC to the process) and use 
that as an additional balancing metric.  Or we could have it so once it passes 
a threshold, we shed the regions off of that RS (actually flushing instead of 
needing replay), and then restart the RS process.

> Remove need for heartbeats in HBase
> -----------------------------------
>
>                 Key: HBASE-1502
>                 URL: https://issues.apache.org/jira/browse/HBASE-1502
>             Project: HBase
>          Issue Type: Task
>            Reporter: Nitay Joffe
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: 1502-v2.txt, 1502.txt
>
>
> HBase currently uses heartbeats between region servers and the master, 
> piggybacking information on them when it can. This issue is to investigate if 
> we can get rid of the need for those using ZooKeeper events.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to