[
https://issues.apache.org/jira/browse/HBASE-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994394#comment-12994394
]
Jonathan Gray commented on HBASE-1502:
--------------------------------------
Got it. That sounds like a good start and removing the heartbeat/HMsg is far
more critical than the startup RPC. I remember some issue in the past where it
was odd that discovery was RPC but failure was ZK... but don't recall what
exactly it was now.
+1 on ZK content being JSON serialized.
Just to bring it up since it's loosely related to this stuff, I'm of the
opinion that a second monitor process is eventually going to be necessary. ZK
timeouts are just too high and there are many cases where if we could take GC
pauses out of the equation, we could have much faster failure detection. A
second process that would not have any GC issues could have an ephemeral node
with a much smaller timeout, or it could monitor the RS process and system
itself. I don't see another way towards reducing failure detection times
without triggering false-positives when the RS is in a GC pause (a
"recoverable" fault).
On an even more unrelated note, we could have some kind of metric (or this
other process could figure out) how often GC pauses are happening / for how
long (either through a looping sleep() thread or an RPC to the process) and use
that as an additional balancing metric. Or we could have it so once it passes
a threshold, we shed the regions off of that RS (actually flushing instead of
needing replay), and then restart the RS process.
> Remove need for heartbeats in HBase
> -----------------------------------
>
> Key: HBASE-1502
> URL: https://issues.apache.org/jira/browse/HBASE-1502
> Project: HBase
> Issue Type: Task
> Reporter: Nitay Joffe
> Assignee: stack
> Priority: Blocker
> Fix For: 0.92.0
>
> Attachments: 1502-v2.txt, 1502.txt
>
>
> HBase currently uses heartbeats between region servers and the master,
> piggybacking information on them when it can. This issue is to investigate if
> we can get rid of the need for those using ZooKeeper events.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira