[ 
https://issues.apache.org/jira/browse/HBASE-19501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16289866#comment-16289866
 ] 

stack commented on HBASE-19501:
-------------------------------

.001 is HBASE-19501 and HBASE-18946 squashed together so I can get an hadoopqa 
run in. Will describe content in next post.

> [AMv2] Retain assignment across restarts
> ----------------------------------------
>
>                 Key: HBASE-19501
>                 URL: https://issues.apache.org/jira/browse/HBASE-19501
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Region Assignment
>            Reporter: stack
>            Assignee: stack
>             Fix For: 2.0.0-beta-1
>
>         Attachments: HBASE-19501.master.001.patch, HBASE-19501.patch
>
>
> Working with replicas and the parent test in particular, I learned a few 
> interesting things:
>  # It is hard to test if we retain assignments because our little minicluster 
> gives RegionServers new ports on restart foiling our means of recognizing new 
> instance of a server by checking hostname+port (and ensuring the startcode is 
> larger).
>  # Some of our tests like the parent test depended on retaining assignment 
> across restarts.
>  # As said in parent issue, master used to be last to go down when we did a 
> controlled cluster shutdown. We lost that when we moved to AMv2.
>  # When we do a cluster shutdown, the RegionServers close down the Regions, 
> not the Master as is usual in AMv2 (Master wants to do all assign ops in 
> AMv2). This means that the Master is surprised when it gets notification of 
> CLOSE ops that it did not initiate. Usually on CLOSE, Master updates meta 
> with the CLOSE state. On cluster shutdown we are not doing this.
>  # So, on restart, we read meta and we see all regions still in OPEN state so 
> we think the cluster crashed down so we go and do ServerCrashProcedure. Which 
> hoses our ability to retain assign.
> Some experiments:
>  # I can make the Master stay up so it is last to go down
>  # This makes it so we no longer spew the logs with failed transition 
> messages because Master is not up to receive the CLOSE transitions.
>  # I hacked in means of telling minicluster ports it should use on start; 
> helps fake case of new RS instances
>  # It is hard to tell the difference between a clean shutdown and a crash 
> down. It is dangerous if we get the call wrong. Currently, given that we just 
> let ServerCrashProcedure deal with it -- the safest option -- one experiment 
> is that when it goes to assign the regions that were on the crashed server, 
> rather than round robin, instead we should look and see if new instance of 
> old location and if so, just give it al lthe regions. That'd retain locality. 
> This seems to work. Problem is that SCP is doing assignment. Ideally balancer 
> would do it.
> Let me put up a patch that retains assignment across restart (somehow).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to