[
https://issues.apache.org/jira/browse/MAPREDUCE-4933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13550524#comment-13550524
]
Sandy Ryza commented on MAPREDUCE-4933:
---------------------------------------
If I understand how things work, yes. The length is used to calculate
onDiskBytes. If onDiskBytes is 0, (which can happen falsely without the
patch), the following code won't get called:
{code}
final int numInMemSegments = memDiskSegments.size();
diskSegments.addAll(0, memDiskSegments);
memDiskSegments.clear();
RawKeyValueIterator diskMerge = Merger.merge(
job, fs, keyClass, valueClass, codec, diskSegments,
ioSortFactor, numInMemSegments, tmpDir, comparator,
reporter, false, spilledRecordsCounter, null);
diskSegments.clear();
if (0 == finalSegments.size()) {
return diskMerge;
}
finalSegments.add(new Segment<K,V>(
new RawKVIteratorReader(diskMerge, onDiskBytes), true));
{code}
which if I understand correctly means that some on-disk data won't be
incorporated in the final merge.
> MR1 final merge asks for length of file it just wrote before flushing it
> ------------------------------------------------------------------------
>
> Key: MAPREDUCE-4933
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4933
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv1, task
> Affects Versions: 1.1.1
> Reporter: Sandy Ryza
> Assignee: Sandy Ryza
> Attachments: MAPREDUCE-4933-branch-1.patch
>
>
> createKVIterator in ReduceTask contains the following code:
> {code}
> try {
> Merger.writeFile(rIter, writer, reporter, job);
> addToMapOutputFilesOnDisk(fs.getFileStatus(outputPath));
> } catch (Exception e) {
> if (null != outputPath) {
> fs.delete(outputPath, true);
> }
> throw new IOException("Final merge failed", e);
> } finally {
> if (null != writer) {
> writer.close();
> }
> }
> {code}
> Merger#writeFile() does not close the file after writing it, so when
> fs.getFileStatus() is called on it, it may not return the correct length.
> This causes bad accounting further down the line, which can lead to map
> output data being lost.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira