[
https://issues.apache.org/jira/browse/HBASE-10017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830224#comment-13830224
]
Roman Nikitchenko commented on HBASE-10017:
-------------------------------------------
Yet another note about fix proposed. One even worse problem noted here is not
equal distribution among the partitions. Let us start with something like 30
reducers or more. In this case there will be non-equal distribution. Here is
illustratve code:
public class Main {
public static void main(String [] args) throws Exception {
int numPartitions = 32;
int numRegions = 1000000;
int[] parts = new int[numPartitions];
for (int i = 0; i < numRegions; ++i) {
int part = (Integer.toString(i).hashCode()
& Integer.MAX_VALUE) % numPartitions;
parts[part]++;
}
for (int i = 0; i < numPartitions; ++i) {
System.out.println(parts[i] + " ");
}
}
}
Being run it produces histogram with up to 5 times difference in load per
reducer which is COMPLETELY unacceptable.
> HRegionPartitioner, rows directed to last partition are wrongly mapped.
> -----------------------------------------------------------------------
>
> Key: HBASE-10017
> URL: https://issues.apache.org/jira/browse/HBASE-10017
> Project: HBase
> Issue Type: Bug
> Components: mapreduce
> Affects Versions: 0.94.6
> Reporter: Roman Nikitchenko
> Attachments: HBASE-10017-r1544633.patch
>
>
> Inside HRegionPartitioner class there is getPartition() method which should
> map first numPartitions regions to appropriate partitions 1:1. But based on
> condition last region is hashed which could lead to last reducer not having
> any data. This is considered serious issue.
> I reproduced this only starting from 16 regions per table. Original defect
> was found in 0.94.6 but at least today's trunk and 0.91 branch head have the
> same HRegionPartitioner code in this part which means the same issue.
--
This message was sent by Atlassian JIRA
(v6.1#6144)