[
https://issues.apache.org/jira/browse/HBASE-18036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007393#comment-16007393
]
Stephen Yuan Jiang commented on HBASE-18036:
--------------------------------------------
The V0 patch attached is my first attempt to resolve this issue - The change is
in SSH. By the time that the SSH is run, if the dead region server has already
restarted (we will have the same hostname and port, but different start code in
ServerName), SSH will try to retain the locality by assigning the region back
to the same region server. I introduce a config if someone wants to keep the
round-robin assignment behavior.
I forced the existing TestAssignmentManagerOnCluster tests to use the new code
path in SSH and does not see any problem. The thing missing is that a new UT
in TestAssignmentManagerOnCluster to test the retaining assignment code path in
SSH.
For now, I'd like to post this V0 patch to get some feedback.
> Data locality is not maintained after cluster restart or SSH
> ------------------------------------------------------------
>
> Key: HBASE-18036
> URL: https://issues.apache.org/jira/browse/HBASE-18036
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Affects Versions: 1.4.0, 1.3.1, 1.2.5, 1.1.10
> Reporter: Stephen Yuan Jiang
> Assignee: Stephen Yuan Jiang
> Attachments: HBASE-18036.v0-branch-1.1.patch
>
>
> After HBASE-2896 / HBASE-4402, we think data locality is maintained after
> cluster restart. However, we have seem some complains about data locality
> loss when cluster restart (eg. HBASE-17963).
> Examining the AssignmentManager#processDeadServersAndRegionsInTransition()
> code, for cluster start, I expected to hit the following code path:
> {code}
> if (!failover) {
> // Fresh cluster startup.
> LOG.info("Clean cluster startup. Assigning user regions");
> assignAllUserRegions(allRegions);
> }
> {code}
> where assignAllUserRegions would use retainAssignment() call in LoadBalancer;
> however, from master log, we usually hit the failover code path:
> {code}
> // If we found user regions out on cluster, its a failover.
> if (failover) {
> LOG.info("Found regions out on cluster or in RIT; presuming failover");
> // Process list of dead servers and regions in RIT.
> // See HBASE-4580 for more information.
> processDeadServersAndRecoverLostRegions(deadServers);
> }
> {code}
> where processDeadServersAndRecoverLostRegions() would put dead servers in SSH
> and SSH uses roundRobinAssignment() in LoadBalancer. That is why we would
> see loss locality more often than retaining locality during cluster restart.
> Note: the code I was looking at is close to branch-1 and branch-1.1.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)