Ivan Mitic created MAPREDUCE-6357:
-------------------------------------
Summary: MultipleOutputs.write() API should document that output
committing is not utilized when input path is absolute
Key: MAPREDUCE-6357
URL: https://issues.apache.org/jira/browse/MAPREDUCE-6357
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: documentation
Affects Versions: 2.6.0
Reporter: Ivan Mitic
Assignee: Ivan Mitic
After spending the afternoon debugging a user job where reduce tasks were
failing on retry with the below exception, I think it would be worthwhile to
add a note in the MultipleOutputs.write() documentation, saying that absolute
paths may cause improper execution of tasks on retry or when MR speculative
execution is enabled.
{code}
2015-04-28 23:13:10,452 WARN [main] org.apache.hadoop.mapred.YarnChild:
Exception running child : java.io.IOException: File already
exists:wasb://[email protected]/user/hadoop/some/path/block-r-00299.bz2
at
org.apache.hadoop.fs.azure.NativeAzureFileSystem.create(NativeAzureFileSystem.java:1354)
at
org.apache.hadoop.fs.azure.NativeAzureFileSystem.create(NativeAzureFileSystem.java:1195)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
at
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
at
org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:475)
at
org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:433)
at
com.ancestry.bigtree.hadoop.LevelReducer.processValue(LevelReducer.java:91)
at com.ancestry.bigtree.hadoop.LevelReducer.reduce(LevelReducer.java:69)
at com.ancestry.bigtree.hadoop.LevelReducer.reduce(LevelReducer.java:14)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
{code}
As discussed in MAPREDUCE-3772, when the baseOutputPath passed to
MultipleOutputs.write() is an absolute path (or more precisely a path that
resolves outside of the job output-dir), the concept of output committing is
not utilized.
In this case, the user read thru the MultipleOutputs docs and was assuming that
everything will be working fine, as there are blog posts saying that
MultipleOutputs does handle output commit.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)