No verification on sample size can lead to incorrect partition file and "Split points are out of order" IOException -------------------------------------------------------------------------------------------------------------------
Key: MAPREDUCE-1987
URL: https://issues.apache.org/jira/browse/MAPREDUCE-1987
Project: Hadoop Map/Reduce
Issue Type: Bug
Affects Versions: 0.20.2
Environment: 10 Linux machines with Hadoop 0.20.2 and JDK1.7.0
Reporter: Fabrice Huet
If I understand correctly, the partition file should containt distinct values
in increasing order.
In InputSampler.writePartitionFile (...) if the sample size is lower than the
number of reduce size, the k index might keep the same value. As a side effet
of the while loop, values will be interleaved.
Example : taking 100 samples on a 120 reducers job will produce the following
values of k and last after the while loop
while (last >= k && comparator.compare(samples[last], samples[k]) == 0) {
++k;
}
//display values here
k 68
last 67 //correct
k 69
last 68 //correct
k 68
last 69 //incorrect, samples[68] has already been written
k 69
last 68 //incorrect, samples[69] has already been written
The partition file will be considered as corrupted when reading it with the
TotalOrderPartitioner:
throw new IOException("Split points are out of order");
It seems to me that the number of partitions should be min(samples.length,
job.getNumReduceTasks(), number of distinct values in sample)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
