[ 
https://issues.apache.org/jira/browse/HBASE-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882262#action_12882262
 ] 

Jonathan Gray commented on HBASE-2700:
--------------------------------------

In what situation does the data in ZK not have the actual state?  In order for 
a RS to, for example, open a region, it must transition a node in ZK from 
nothing, to OPENING, to OPENED; if this fails it does not open.  It seems to me 
that it is META which may not be up to date and META which can change without 
the proper notifications being sent.

In style where we ask RS what they host and match that up against META, we then 
must do all edits of META on master side.  Otherwise there will always be race 
conditions between what master thinks is the state (via meta scan) and what the 
actual state is (via RS setting stuff in meta).  ZK allows us to ensure we 
never miss states and transitions.

For second list of RS up in ZK, we could get this data in META but what about 
case where a RS died while something was getting assigned to it but it did not 
finish opening and died?  Whether this is a problem or not depends very much on 
who is the one who edits meta, whether we rely on meta to determine something 
is not assigned, etc...

There has been consideration as to how this is handled in BT paper but I guess 
I just am of the mindset that the explicit, persistent message passing via ZK 
is a better direction than the meta scanning / per-rs check-in / heartbeating.  
What happens if we have 1000 RS and 1M regions?  That's a significant amount of 
work to do.  What if a single RS happens to be in a 10 second GC pause?  What 
about race conditions between what is in META and what the RSs know about?  
What if we see in META something is unassigned but the previous master asked an 
RS to open it?  That RS is in "opening" state but it is not yet assigned so 
would it come back with the list of assigned regions to that server?  This is 
super explicit via transitions in zk.

As for all in memory, I think we can punt on this for a while.  The only thing 
pertinent to this discussion is that if holding it all in memory is possibly 
untenable, doesn't that mean that it's untenable to do master failover in this 
style (hold every RS and its R after asking it via RPC, and holding the META 
view of every R and the RS it is assigned to)?

> Handle master failover for regions in transition
> ------------------------------------------------
>
>                 Key: HBASE-2700
>                 URL: https://issues.apache.org/jira/browse/HBASE-2700
>             Project: HBase
>          Issue Type: Sub-task
>          Components: master, zookeeper
>            Reporter: Jonathan Gray
>            Priority: Critical
>             Fix For: 0.21.0
>
>
> To this point in HBASE-2692 tasks we have moved everything for regions in 
> transition into ZK, but we have not fully handled the master failover case.  
> This is to deal with that and to write tests for it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to