Ted Malaska created MAPREDUCE-6367:
--------------------------------------

             Summary: UniformSizeInputFormat skews left over bytes to last split
                 Key: MAPREDUCE-6367
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6367
             Project: Hadoop Map/Reduce
          Issue Type: Bug
    Affects Versions: 2.5.2, 2.6.0
            Reporter: Ted Malaska
            Assignee: Ted Malaska
            Priority: Minor


In UniformSizeInputFormat it is trying to get equal amount of bytes to every 
split. But the logic today will result in every split having a little less then 
the perfect amount and that left over from every split will be put into the 
last split.

Resulting in a large skew for the last split.

Below if the area of the code that is affected:

https://github.com/apache/hadoop/blob/9ae7f9eb7baeb244e1b95aabc93ad8124870b9a9/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/UniformSizeInputFormat.java#L98

The fix would be to change the following line:

currentSplitSize += srcFileStatus.getLen();

to

currentSplitSize += srcFileStatus.getLen() + (currentSplitSize - 
nBytesPerSplit);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to