Ted Malaska created MAPREDUCE-6367:
--------------------------------------
Summary: UniformSizeInputFormat skews left over bytes to last split
Key: MAPREDUCE-6367
URL: https://issues.apache.org/jira/browse/MAPREDUCE-6367
Project: Hadoop Map/Reduce
Issue Type: Bug
Affects Versions: 2.5.2, 2.6.0
Reporter: Ted Malaska
Assignee: Ted Malaska
Priority: Minor
In UniformSizeInputFormat it is trying to get equal amount of bytes to every
split. But the logic today will result in every split having a little less then
the perfect amount and that left over from every split will be put into the
last split.
Resulting in a large skew for the last split.
Below if the area of the code that is affected:
https://github.com/apache/hadoop/blob/9ae7f9eb7baeb244e1b95aabc93ad8124870b9a9/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/UniformSizeInputFormat.java#L98
The fix would be to change the following line:
currentSplitSize += srcFileStatus.getLen();
to
currentSplitSize += srcFileStatus.getLen() + (currentSplitSize -
nBytesPerSplit);
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)