[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated MAPREDUCE-6336:
-------------------------------------
       Resolution: Fixed
    Fix Version/s: 3.0.0
     Release Note: 
mapreduce.fileoutputcommitter.algorithm.version now defaults to 2.
  
In algorithm version 1:

  1. commitTask renames directory
  $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/
  to
  $joboutput/_temporary/$appAttemptID/$taskID/

  2. recoverTask renames
  $joboutput/_temporary/$appAttemptID/$taskID/
  to
  $joboutput/_temporary/($appAttemptID + 1)/$taskID/

  3. commitJob merges every task output file in
  $joboutput/_temporary/$appAttemptID/$taskID/
  to
  $joboutput/, then it will delete $joboutput/_temporary/
  and write $joboutput/_SUCCESS

commitJob's run time, number of RPC, is O(n) in terms of output files, which is 
discussed in MAPREDUCE-4815, and can take minutes. 

Algorithm version 2 changes the behavior of commitTask, recoverTask, and 
commitJob.

  1. commitTask renames all files in
  $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/
  to $joboutput/

  2. recoverTask is a nop strictly speaking, but for
  upgrade from version 1 to version 2 case, it checks if there
  are any files in
  $joboutput/_temporary/($appAttemptID - 1)/$taskID/
  and renames them to $joboutput/

  3. commitJob deletes $joboutput/_temporary and writes
  $joboutput/_SUCCESS

Algorithm 2 takes advantage of task parallelism and makes commitJob itself 
O(1). However, the window of vulnerability for having incomplete output in 
$jobOutput directory is much larger. Therefore, pipeline logic for consuming 
job outputs should be built on checking for existence of _SUCCESS marker.
     Hadoop Flags: Incompatible change,Reviewed
           Status: Resolved  (was: Patch Available)

Thanks, [~l201514] for contribution, and [~jlowe] for review! Committed to 
trunk.

> Enable v2 FileOutputCommitter by default
> ----------------------------------------
>
>                 Key: MAPREDUCE-6336
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6336
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv2
>    Affects Versions: 2.7.0
>            Reporter: Gera Shegalov
>            Assignee: Siqi Li
>              Labels: BB2015-05-TBR
>             Fix For: 3.0.0
>
>         Attachments: MAPREDUCE-6336.v1.patch
>
>
> This JIRA is to propose making new FileOutputCommitter behavior from 
> MAPREDUCE-4815 enabled by default in trunk, and potentially in branch-2. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to