Ivan Mitic created MAPREDUCE-6357:
-------------------------------------

             Summary: MultipleOutputs.write() API should document that output 
committing is not utilized when input path is absolute
                 Key: MAPREDUCE-6357
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6357
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: documentation
    Affects Versions: 2.6.0
            Reporter: Ivan Mitic
            Assignee: Ivan Mitic


After spending the afternoon debugging a user job where reduce tasks were 
failing on retry with the below exception, I think it would be worthwhile to 
add a note in the MultipleOutputs.write() documentation, saying that absolute 
paths may cause improper execution of tasks on retry or when MR speculative 
execution is enabled. 

{code}
2015-04-28 23:13:10,452 WARN [main] org.apache.hadoop.mapred.YarnChild: 
Exception running child : java.io.IOException: File already 
exists:wasb://[email protected]/user/hadoop/some/path/block-r-00299.bz2
       at 
org.apache.hadoop.fs.azure.NativeAzureFileSystem.create(NativeAzureFileSystem.java:1354)
       at 
org.apache.hadoop.fs.azure.NativeAzureFileSystem.create(NativeAzureFileSystem.java:1195)
       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
       at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
       at 
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
       at 
org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:475)
       at 
org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:433)
       at 
com.ancestry.bigtree.hadoop.LevelReducer.processValue(LevelReducer.java:91)
       at com.ancestry.bigtree.hadoop.LevelReducer.reduce(LevelReducer.java:69)
       at com.ancestry.bigtree.hadoop.LevelReducer.reduce(LevelReducer.java:14)
       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
       at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
       at java.security.AccessController.doPrivileged(Native Method)
       at javax.security.auth.Subject.doAs(Subject.java:415)
       at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
{code}

As discussed in MAPREDUCE-3772, when the baseOutputPath passed to 
MultipleOutputs.write() is an absolute path (or more precisely a path that 
resolves outside of the job output-dir), the concept of output committing is 
not utilized. 

In this case, the user read thru the MultipleOutputs docs and was assuming that 
everything will be working fine, as there are blog posts saying that 
MultipleOutputs does handle output commit. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to