[ 
https://issues.apache.org/jira/browse/HBASE-678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629069#action_12629069
 ] 

Billy Pearson commented on HBASE-678:
-------------------------------------

I thank we should have a multi stage/process safe mode on the master not just 
for the clients but to handle crash recovery of regions and region balancing 
while we are in safe mode.
we have a issue for helping the balancing out on HBASE-862 but I thank it will 
still be helpful to include all start up balancing while in safe mode

Assuming we do not run just queue up needed compaction/split checks on loading 
regions while in safe mode.
Stage 1: Deploy all regions
Stage 2:  Do any crash recovery needed and do a flush to get that to disk 
(remove recovery logs on success flush)
Stage 3: Do any balancing of the regions before exiting the safe mode if needed.

Stage 3 is there so we do not have any compactions or splits running on the 
regions and we can move them around as we need to to balance the region count 
out. 
If there is no compactions running closes happen immediately.

I seen some re balancing happen on start up and the region servers go crazy 
trying to balance as Daniel commented above. 
This in my cluster is mostly from regions closing having to wait for running 
compaction creating a lag in the balancing counts
When the compactions finish and the region get closed and redeploy the counts 
are all out of balance again and the same thing happens over and over until 
almost all the compactions are done
and the regions can close and redeploy with out lag of the compactions.
Once we have done the above all will be ready for the clients to connect to the 
cluster with out having to worry about churn in balancing or crash recovering 
regions.

Daniel: If we block region balancing while in Safe Mode your clients can 
connect when we come out of safe mode but then balancing will kick in and you 
will see the same churn as we have now.

> hbase needs a 'safe-mode'
> -------------------------
>
>                 Key: HBASE-678
>                 URL: https://issues.apache.org/jira/browse/HBASE-678
>             Project: Hadoop HBase
>          Issue Type: Improvement
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Critical
>             Fix For: 0.19.0
>
>
> Internally we have a cluster of thousands of regions.  We just did a hbase 
> restart w/ master on new node.  Just so happened that one of the 
> regionservers was running extra slow (was downloaded by other processes).  
> Meant that its portion of the assigments was taking a long time to come up... 
>  While these regions were stuck in deploy mode, the cluster is not useable.
> We need a sort of 'safe-mode' in hbase where clients fail if they try to 
> attach to a cluster not yet fully up.  UI should show when all assignments 
> have been successfully made so admin can at least see when they have a 
> problematic regionserver in their midst.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to