[
https://issues.apache.org/jira/browse/MAPREDUCE-6336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gera Shegalov updated MAPREDUCE-6336:
-------------------------------------
Resolution: Fixed
Fix Version/s: 3.0.0
Release Note:
mapreduce.fileoutputcommitter.algorithm.version now defaults to 2.
In algorithm version 1:
1. commitTask renames directory
$joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/
to
$joboutput/_temporary/$appAttemptID/$taskID/
2. recoverTask renames
$joboutput/_temporary/$appAttemptID/$taskID/
to
$joboutput/_temporary/($appAttemptID + 1)/$taskID/
3. commitJob merges every task output file in
$joboutput/_temporary/$appAttemptID/$taskID/
to
$joboutput/, then it will delete $joboutput/_temporary/
and write $joboutput/_SUCCESS
commitJob's run time, number of RPC, is O(n) in terms of output files, which is
discussed in MAPREDUCE-4815, and can take minutes.
Algorithm version 2 changes the behavior of commitTask, recoverTask, and
commitJob.
1. commitTask renames all files in
$joboutput/_temporary/$appAttemptID/_temporary/$taskAttemptID/
to $joboutput/
2. recoverTask is a nop strictly speaking, but for
upgrade from version 1 to version 2 case, it checks if there
are any files in
$joboutput/_temporary/($appAttemptID - 1)/$taskID/
and renames them to $joboutput/
3. commitJob deletes $joboutput/_temporary and writes
$joboutput/_SUCCESS
Algorithm 2 takes advantage of task parallelism and makes commitJob itself
O(1). However, the window of vulnerability for having incomplete output in
$jobOutput directory is much larger. Therefore, pipeline logic for consuming
job outputs should be built on checking for existence of _SUCCESS marker.
Hadoop Flags: Incompatible change,Reviewed
Status: Resolved (was: Patch Available)
Thanks, [~l201514] for contribution, and [~jlowe] for review! Committed to
trunk.
> Enable v2 FileOutputCommitter by default
> ----------------------------------------
>
> Key: MAPREDUCE-6336
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6336
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: mrv2
> Affects Versions: 2.7.0
> Reporter: Gera Shegalov
> Assignee: Siqi Li
> Labels: BB2015-05-TBR
> Fix For: 3.0.0
>
> Attachments: MAPREDUCE-6336.v1.patch
>
>
> This JIRA is to propose making new FileOutputCommitter behavior from
> MAPREDUCE-4815 enabled by default in trunk, and potentially in branch-2.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)