No verification on sample size can lead to incorrect partition file and "Split points are out of order" IOException -------------------------------------------------------------------------------------------------------------------
Key: MAPREDUCE-1987 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1987 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.20.2 Environment: 10 Linux machines with Hadoop 0.20.2 and JDK1.7.0 Reporter: Fabrice Huet If I understand correctly, the partition file should containt distinct values in increasing order. In InputSampler.writePartitionFile (...) if the sample size is lower than the number of reduce size, the k index might keep the same value. As a side effet of the while loop, values will be interleaved. Example : taking 100 samples on a 120 reducers job will produce the following values of k and last after the while loop while (last >= k && comparator.compare(samples[last], samples[k]) == 0) { ++k; } //display values here k 68 last 67 //correct k 69 last 68 //correct k 68 last 69 //incorrect, samples[68] has already been written k 69 last 68 //incorrect, samples[69] has already been written The partition file will be considered as corrupted when reading it with the TotalOrderPartitioner: throw new IOException("Split points are out of order"); It seems to me that the number of partitions should be min(samples.length, job.getNumReduceTasks(), number of distinct values in sample) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.