[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Oliver Hummel updated MAPREDUCE-7085:
-------------------------------------
    Description: 
After getting the split points are out of order exception from 
TotalOrderPartitioner, I dug into the source of the 
org.apache.hadoop.mapreduce.lib.partition.InputSampler class and found that the 
while loop in line 335 is never entered.

The reason is that the variable last is always smaller than k and the loop 
condition says that last must be larger or equal than k.

I am not completely sure of the initial purpose of this loop, if it is what I 
assume, namely reducing the occurrences of identical split points, then I would 
change it like so:

  while (last != -1 && k > last && comparator.compare(samples[last], 
samples[k]) == 0) {     
    --k;            
  }

However, this only slightly mitigates the problem, since a highly skewed 
distribution of keys still might lead to identical split points so that 
potentially further measures might be necessary?

  was:
After getting the split points are out of order exception from 
TotalOrderPartitioner, I dug into the source of the 
org.apache.hadoop.mapreduce.lib.partition.InputSampler class and found that the 
while loop in line 335 is never entered.

The reason is that the variable last is always smaller than k and the loop 
condition says that last must be larger or equal than k.

I am not completely sure of the initial purpose of this loop, if it is what I 
assume, namely reducing the occurrences of identical split points, then I would 
change it like so:

            while (last != -1 && k > last && comparator.compare(samples[last], 
samples[k]) == 0)

{                 --k;             }

However, this only slightly mitigates the problem, since a highly skewed 
distribution of keys still might lead to identical split points so that 
potentially further measures might be necessary?


> while loop in InputSampler.writePartitionFile method does not make sense
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7085
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7085
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.8.3
>            Reporter: Oliver Hummel
>            Priority: Minor
>
> After getting the split points are out of order exception from 
> TotalOrderPartitioner, I dug into the source of the 
> org.apache.hadoop.mapreduce.lib.partition.InputSampler class and found that 
> the while loop in line 335 is never entered.
> The reason is that the variable last is always smaller than k and the loop 
> condition says that last must be larger or equal than k.
> I am not completely sure of the initial purpose of this loop, if it is what I 
> assume, namely reducing the occurrences of identical split points, then I 
> would change it like so:
>   while (last != -1 && k > last && comparator.compare(samples[last], 
> samples[k]) == 0) {     
>     --k;            
>   }
> However, this only slightly mitigates the problem, since a highly skewed 
> distribution of keys still might lead to identical split points so that 
> potentially further measures might be necessary?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to