churro morales created HBASE-14129:
--------------------------------------

             Summary: If any regionserver gets shutdown uncleanly during full 
cluster restart, locality looks to be lost
                 Key: HBASE-14129
                 URL: https://issues.apache.org/jira/browse/HBASE-14129
             Project: HBase
          Issue Type: Bug
            Reporter: churro morales


We were doing a cluster restart the other day.  Some regionservers did not shut 
down cleanly.  Upon restart our locality went from 99% to 5%.  Upon looking at 
the AssignmentManager.joinCluster() code it calls 
AssignmentManager.processDeadServersAndRegionsInTransition().
If the failover flag gets set for any reason it seems we don't call 
assignAllUserRegions().  Then it looks like the balancer does the work in 
assigning those regions, we don't use a locality aware balancer and we lost our 
region locality.

I don't have a solid grasp on the reasoning for these checks but there could be 
some potential workarounds here.

1. After shutting down your cluster, move your WALs aside (replay later).  
2. Clean up your zNodes 

That seems to work, but requires a lot of manual labor.  Another solution which 
I prefer would be to have a flag for ./start-hbase.sh --clean 

If we start master with that flag then we do a check in 
AssignmentManager.processDeadServersAndRegionsInTransition()  thus if this flag 
is set we call: assignAllUserRegions() regardless of the failover state.

I have a patch for the later solution, that is if I am understanding the logic 
correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to