[
https://issues.apache.org/jira/browse/HBASE-14215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729961#comment-14729961
]
Enis Soztutar commented on HBASE-14215:
---------------------------------------
bq. Given that potential candidates are generated randomly, one would assume
that "global optimum" will be attained with multiple candidate generations and
there will be no "local optimum". No?
SLB works something like the https://en.wikipedia.org/wiki/Gradient_descent,
except that we do not generate all the "candidates" that are the neighbors for
the next assignment plan. We randomly generate a new candidate plan, and we
always accept the candidate plan if it reduces the cost. This greedy search is
thus vulnerable to local minimas.
bq. As we included a new cost function for primary replication skew, will
taking into account of primary replicas in the candidate generator (may be in
RegionReplicaCandidateGenerator) can help keep
hbase.master.balancer.stochastic.primaryRegionCountCost lower?
It might. RRCG has a code section which prefers to move the secondary region
replica to move out, rather than a primary. Maybe that is causing more primary
region count skew more. Do you want to try cutting it out, or trying with a
changing the candidate generator? I can take on this if you want (there is a
way that can simulate a cluster and assignment plans in a unit test env so that
we can iterate quick).
{code}
// we have found the primary id for the region to move. Now find the
actual regionIndex
// with the given primary, prefer to move the secondary region.
for (int j = 0; j < regionsPerGroup.length; j++) {
int regionIndex = regionsPerGroup[j];
if (selectedPrimaryIndex == regionIndexToPrimaryIndex[regionIndex]) {
// always move the secondary, not the primary
if (selectedPrimaryIndex != regionIndex) {
return regionIndex;
}
}
}
{code}
> Default cost used for PrimaryRegionCountSkewCostFunction is not sufficient
> ---------------------------------------------------------------------------
>
> Key: HBASE-14215
> URL: https://issues.apache.org/jira/browse/HBASE-14215
> Project: HBase
> Issue Type: Bug
> Components: Balancer
> Reporter: Biju Nair
> Priority: Minor
> Attachments: 14215-v1.txt
>
>
> Current multiplier of 500 used in the stochastic balancer cost function
> {{PrimaryRegionCountSkewCostFunction}} to calculate the cost of total
> primary replication skew doesn't seem to be sufficient to prevent the skews
> (Refer HBASE-14110). We would want the default cost to be a higher value so
> that skews in primary region replica has higher cost. The following is the
> test result by setting the multiplier value to 10000 (same as the region
> replica rack cost multiplier) on a 3 Rack 9 RS node cluster which seems to
> get the balancer distribute the primaries uniformly.
> *Initial Primary replica distribution - using the current multiplier*
> |r1n10| 102|
> |r1n11| 85|
> |r1n9| 88|
> |r2n10| 120|
> |r2n11| 120|
> |r2n9| 124|
> |r3n10| 135|
> |r3n11| 124|
> |r3n9| 129|
> *After long duration of read & writes - using current multiplier*
> | r1n10| 102|
> | r1n11| 85|
> | r1n9| 88|
> | r2n10| 120|
> | r2n11| 120|
> | r2n9 | 124|
> | r3n10| 135|
> | r3n11| 124|
> | r3n9| 129|
> *After manual balancing*
> | r1n10| 102|
> | r1n11| 85|
> | r1n9| 88|
> | r2n10| 120|
> | r2n11| 120|
> | r2n9 | 124|
> | r3n10| 135|
> | r3n11| 124|
> | r3n9| 129|
> *Increased multiplier for primaryRegionCountSkewCost to 10000*
> | r1n10| 114|
> | r1n11 | 113|
> | r1n9 | 114|
> | r2n10| 114|
> | r2n11| 114|
> | r2n9 | 113|
> | r3n10| 115|
> | r3n11| 115|
> | r3n9 | 115 |
> Setting the {{PrimaryRegionCountSkewCostFunction}} multiplier value to 10000
> should help HBase general use.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)