No verification on sample size can lead to incorrect partition file and "Split 
points are out of order" IOException
-------------------------------------------------------------------------------------------------------------------

                 Key: MAPREDUCE-1987
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1987
             Project: Hadoop Map/Reduce
          Issue Type: Bug
    Affects Versions: 0.20.2
         Environment: 10 Linux machines with Hadoop 0.20.2 and JDK1.7.0
            Reporter: Fabrice Huet


If I understand correctly, the partition file should containt distinct values 
in increasing order.
In InputSampler.writePartitionFile (...)  if  the sample size is lower than the 
number of reduce size, the k index might keep the same value. As a side effet 
of the while loop, values will be interleaved.

Example : taking 100 samples on a 120 reducers job will produce the following 
values of k and last after the while loop 
    while (last >= k && comparator.compare(samples[last], samples[k]) == 0) {
        ++k;
      } 
   //display values here 

                 k 68                                                           
                                                                                
              
                 last 67        //correct                                       
                                                                                
                       
                                                                                
                                                             
                 k 69                                                           
                                                                                
              
                 last 68      //correct                                         
                                                                                
                       
                                                                                
                                                      
                 k 68                                                           
                                                                                
              
                 last 69    //incorrect, samples[68] has already been written   
                                                                                
                                                               
                                                                                
                                                                
                 k 69                                                           
                                                                                
              
                 last 68    //incorrect, samples[69] has already been written   
      

The partition file will be considered as corrupted when reading it  with the 
TotalOrderPartitioner:
   throw new IOException("Split points are out of order");

It seems to me that the number of partitions should be min(samples.length,  
job.getNumReduceTasks(), number of distinct values in sample)



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to