[
https://issues.apache.org/jira/browse/HBASE-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882262#action_12882262
]
Jonathan Gray commented on HBASE-2700:
--------------------------------------
In what situation does the data in ZK not have the actual state? In order for
a RS to, for example, open a region, it must transition a node in ZK from
nothing, to OPENING, to OPENED; if this fails it does not open. It seems to me
that it is META which may not be up to date and META which can change without
the proper notifications being sent.
In style where we ask RS what they host and match that up against META, we then
must do all edits of META on master side. Otherwise there will always be race
conditions between what master thinks is the state (via meta scan) and what the
actual state is (via RS setting stuff in meta). ZK allows us to ensure we
never miss states and transitions.
For second list of RS up in ZK, we could get this data in META but what about
case where a RS died while something was getting assigned to it but it did not
finish opening and died? Whether this is a problem or not depends very much on
who is the one who edits meta, whether we rely on meta to determine something
is not assigned, etc...
There has been consideration as to how this is handled in BT paper but I guess
I just am of the mindset that the explicit, persistent message passing via ZK
is a better direction than the meta scanning / per-rs check-in / heartbeating.
What happens if we have 1000 RS and 1M regions? That's a significant amount of
work to do. What if a single RS happens to be in a 10 second GC pause? What
about race conditions between what is in META and what the RSs know about?
What if we see in META something is unassigned but the previous master asked an
RS to open it? That RS is in "opening" state but it is not yet assigned so
would it come back with the list of assigned regions to that server? This is
super explicit via transitions in zk.
As for all in memory, I think we can punt on this for a while. The only thing
pertinent to this discussion is that if holding it all in memory is possibly
untenable, doesn't that mean that it's untenable to do master failover in this
style (hold every RS and its R after asking it via RPC, and holding the META
view of every R and the RS it is assigned to)?
> Handle master failover for regions in transition
> ------------------------------------------------
>
> Key: HBASE-2700
> URL: https://issues.apache.org/jira/browse/HBASE-2700
> Project: HBase
> Issue Type: Sub-task
> Components: master, zookeeper
> Reporter: Jonathan Gray
> Priority: Critical
> Fix For: 0.21.0
>
>
> To this point in HBASE-2692 tasks we have moved everything for regions in
> transition into ZK, but we have not fully handled the master failover case.
> This is to deal with that and to write tests for it.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.