[
https://issues.apache.org/jira/browse/HBASE-678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629069#action_12629069
]
Billy Pearson commented on HBASE-678:
-------------------------------------
I thank we should have a multi stage/process safe mode on the master not just
for the clients but to handle crash recovery of regions and region balancing
while we are in safe mode.
we have a issue for helping the balancing out on HBASE-862 but I thank it will
still be helpful to include all start up balancing while in safe mode
Assuming we do not run just queue up needed compaction/split checks on loading
regions while in safe mode.
Stage 1: Deploy all regions
Stage 2: Do any crash recovery needed and do a flush to get that to disk
(remove recovery logs on success flush)
Stage 3: Do any balancing of the regions before exiting the safe mode if needed.
Stage 3 is there so we do not have any compactions or splits running on the
regions and we can move them around as we need to to balance the region count
out.
If there is no compactions running closes happen immediately.
I seen some re balancing happen on start up and the region servers go crazy
trying to balance as Daniel commented above.
This in my cluster is mostly from regions closing having to wait for running
compaction creating a lag in the balancing counts
When the compactions finish and the region get closed and redeploy the counts
are all out of balance again and the same thing happens over and over until
almost all the compactions are done
and the regions can close and redeploy with out lag of the compactions.
Once we have done the above all will be ready for the clients to connect to the
cluster with out having to worry about churn in balancing or crash recovering
regions.
Daniel: If we block region balancing while in Safe Mode your clients can
connect when we come out of safe mode but then balancing will kick in and you
will see the same churn as we have now.
> hbase needs a 'safe-mode'
> -------------------------
>
> Key: HBASE-678
> URL: https://issues.apache.org/jira/browse/HBASE-678
> Project: Hadoop HBase
> Issue Type: Improvement
> Reporter: stack
> Assignee: Jim Kellerman
> Priority: Critical
> Fix For: 0.19.0
>
>
> Internally we have a cluster of thousands of regions. We just did a hbase
> restart w/ master on new node. Just so happened that one of the
> regionservers was running extra slow (was downloaded by other processes).
> Meant that its portion of the assigments was taking a long time to come up...
> While these regions were stuck in deploy mode, the cluster is not useable.
> We need a sort of 'safe-mode' in hbase where clients fail if they try to
> attach to a cluster not yet fully up. UI should show when all assignments
> have been successfully made so admin can at least see when they have a
> problematic regionserver in their midst.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.