[
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235898#comment-14235898
]
Shrijeet Paliwal commented on MAPREDUCE-4815:
---------------------------------------------
I worked with [~l201514] to derive patch #8 (patch #7 is broke, do not use it)
& want to report there might be a small bug left to resolve in #8 too. It
leaves a temp directory behind after finishing the job successfully. For
example if your job write to /foo/bar it might be leaving /foor/bar_temporary
as a by product. The temp directory is empty. I will submit an amend to fix
this.
Also post this patch any job which bypasses contex.write approach to write to
o/p, will fail to work.
For example if in mapper/reducer ones opens a file directly & writes to job's
o/p directory - that o/p will be lost. Pi estimator does following & it fails
after this patch.
{noformat}
@Override
public void cleanup(Context context) throws IOException {
//write output to a file
Configuration conf = context.getConfiguration();
Path outDir = new Path(conf.get(FileOutputFormat.OUTDIR));
Path outFile = new Path(outDir, "reduce-out");
FileSystem fileSys = FileSystem.get(conf);
SequenceFile.Writer writer = SequenceFile.createWriter(fileSys, conf,
outFile, LongWritable.class, LongWritable.class,
CompressionType.NONE);
writer.append(new LongWritable(numInside), new LongWritable(numOutside));
writer.close();
}
{noformat}
> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>
> Key: MAPREDUCE-4815
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
> Reporter: Jason Lowe
> Assignee: Siqi Li
> Attachments: MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch,
> MAPREDUCE-4815.v5.patch, MAPREDUCE-4815.v6.patch, MAPREDUCE-4815.v7.patch,
> MAPREDUCE-4815.v8.patch
>
>
> If a job generates many files to commit then the commitJob method call at the
> end of the job can take minutes. This is a performance regression from 1.x,
> as 1.x had the tasks commit directly to the final output directory as they
> were completing and commitJob had very little to do. The commit work was
> processed in parallel and overlapped the processing of outstanding tasks. In
> 0.23/2.x, the commit is single-threaded and waits until all tasks have
> completed before commencing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)