[ https://issues.apache.org/jira/browse/MAPREDUCE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gera Shegalov updated MAPREDUCE-6336: ------------------------------------- Resolution: Fixed Fix Version/s: 3.0.0 Release Note: mapreduce.fileoutputcommitter.algorithm.version now defaults to 2. In algorithm version 1: 1. commitTask renames directory $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/_temporary/$appAttemptID/$taskID/ 2. recoverTask renames $joboutput/_temporary/$appAttemptID/$taskID/ to $joboutput/_temporary/($appAttemptID + 1)/$taskID/ 3. commitJob merges every task output file in $joboutput/_temporary/$appAttemptID/$taskID/ to $joboutput/, then it will delete $joboutput/_temporary/ and write $joboutput/_SUCCESS commitJob's run time, number of RPC, is O(n) in terms of output files, which is discussed in MAPREDUCE-4815, and can take minutes. Algorithm version 2 changes the behavior of commitTask, recoverTask, and commitJob. 1. commitTask renames all files in $joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/ to $joboutput/ 2. recoverTask is a nop strictly speaking, but for upgrade from version 1 to version 2 case, it checks if there are any files in $joboutput/_temporary/($appAttemptID - 1)/$taskID/ and renames them to $joboutput/ 3. commitJob deletes $joboutput/_temporary and writes $joboutput/_SUCCESS Algorithm 2 takes advantage of task parallelism and makes commitJob itself O(1). However, the window of vulnerability for having incomplete output in $jobOutput directory is much larger. Therefore, pipeline logic for consuming job outputs should be built on checking for existence of _SUCCESS marker. Hadoop Flags: Incompatible change,Reviewed Status: Resolved (was: Patch Available) Thanks, [~l201514] for contribution, and [~jlowe] for review! Committed to trunk. > Enable v2 FileOutputCommitter by default > ---------------------------------------- > > Key: MAPREDUCE-6336 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6336 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mrv2 > Affects Versions: 2.7.0 > Reporter: Gera Shegalov > Assignee: Siqi Li > Labels: BB2015-05-TBR > Fix For: 3.0.0 > > Attachments: MAPREDUCE-6336.v1.patch > > > This JIRA is to propose making new FileOutputCommitter behavior from > MAPREDUCE-4815 enabled by default in trunk, and potentially in branch-2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)