churro morales created HBASE-14129:
--------------------------------------
Summary: If any regionserver gets shutdown uncleanly during full
cluster restart, locality looks to be lost
Key: HBASE-14129
URL: https://issues.apache.org/jira/browse/HBASE-14129
Project: HBase
Issue Type: Bug
Reporter: churro morales
We were doing a cluster restart the other day. Some regionservers did not shut
down cleanly. Upon restart our locality went from 99% to 5%. Upon looking at
the AssignmentManager.joinCluster() code it calls
AssignmentManager.processDeadServersAndRegionsInTransition().
If the failover flag gets set for any reason it seems we don't call
assignAllUserRegions(). Then it looks like the balancer does the work in
assigning those regions, we don't use a locality aware balancer and we lost our
region locality.
I don't have a solid grasp on the reasoning for these checks but there could be
some potential workarounds here.
1. After shutting down your cluster, move your WALs aside (replay later).
2. Clean up your zNodes
That seems to work, but requires a lot of manual labor. Another solution which
I prefer would be to have a flag for ./start-hbase.sh --clean
If we start master with that flag then we do a check in
AssignmentManager.processDeadServersAndRegionsInTransition() thus if this flag
is set we call: assignAllUserRegions() regardless of the failover state.
I have a patch for the later solution, that is if I am understanding the logic
correctly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)