[jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files

Shrijeet Paliwal (JIRA) Fri, 05 Dec 2014 10:55:26 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14235898#comment-14235898
 ]


Shrijeet Paliwal commented on MAPREDUCE-4815:
---------------------------------------------

I worked with [~l201514] to derive patch #8 (patch #7 is broke, do not use it) 
& want to report there might be a small bug left to resolve in #8 too. It 
leaves a temp directory behind after finishing the job successfully. For 
example if your job write to /foo/bar it might be leaving /foor/bar_temporary 
as a by product. The temp directory is empty. I will submit an amend to fix 
this.

Also post this patch any job which bypasses contex.write approach to write to 
o/p, will fail to work. 

For example if in mapper/reducer ones opens a file directly & writes to job's 
o/p directory - that o/p will be lost. Pi estimator does following & it fails 
after this patch. 

{noformat}
  @Override
    public void cleanup(Context context) throws IOException {
      //write output to a file
      Configuration conf = context.getConfiguration();
      Path outDir = new Path(conf.get(FileOutputFormat.OUTDIR));
      Path outFile = new Path(outDir, "reduce-out");
      FileSystem fileSys = FileSystem.get(conf);
      SequenceFile.Writer writer = SequenceFile.createWriter(fileSys, conf,
          outFile, LongWritable.class, LongWritable.class, 
          CompressionType.NONE);
      writer.append(new LongWritable(numInside), new LongWritable(numOutside));
      writer.close();
    }
{noformat}


> FileOutputCommitter.commitJob can be very slow for jobs with many output files
> ------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4815
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4815
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.3, 2.0.1-alpha, 2.4.1
>            Reporter: Jason Lowe
>            Assignee: Siqi Li
>         Attachments: MAPREDUCE-4815.v3.patch, MAPREDUCE-4815.v4.patch, 
> MAPREDUCE-4815.v5.patch, MAPREDUCE-4815.v6.patch, MAPREDUCE-4815.v7.patch, 
> MAPREDUCE-4815.v8.patch
>
>
> If a job generates many files to commit then the commitJob method call at the 
> end of the job can take minutes.  This is a performance regression from 1.x, 
> as 1.x had the tasks commit directly to the final output directory as they 
> were completing and commitJob had very little to do.  The commit work was 
> processed in parallel and overlapped the processing of outstanding tasks.  In 
> 0.23/2.x, the commit is single-threaded and waits until all tasks have 
> completed before commencing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MAPREDUCE-4815) FileOutputCommitter.commitJob can be very slow for jobs with many output files

Reply via email to