[jira] [Commented] (HADOOP-8045) org.apache.hadoop.mapreduce.lib.output.MultipleOutputs does not handle many files well

Harsh J (Commented) (JIRA) Thu, 09 Feb 2012 07:18:22 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-8045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204582#comment-13204582
 ]


Harsh J commented on HADOOP-8045:
---------------------------------

Ok, thats a little odd. But the NN does exclude DNs based on # of transfer 
threads load. That is what is affecting you -- error at DN or not, cause of 120 
requests of write per task (sure you want small files?). You could also raise 
your settings 2x and try to see if it elevates or goes away.

In any case, I'm +1 on adding a specific closing API to MultipleOutputs to 
close a given named output.
Can you however, add it to the mapred.lib.MultipleOutputs (Stable API) as well?

Comments on existing patch btw:
* The javadoc can actually reside over the new function you've added. Something 
like "This func is useful in reducers where after writing a particular key as 
an output, you may close it to save on fs connections."
* Once closed, the writer must be moved out of the collection.
* New addition requires test cases, as nothing covers this API call right now. 
Please add a test case that tests your new method. There are existing tests 
inside of TestMultipleOutputs (Stable API - you need to add), and 
TestMRMultipleOutputs (Unstable, new API - your patch).

Thanks!
                
> org.apache.hadoop.mapreduce.lib.output.MultipleOutputs does not handle many 
> files well
> --------------------------------------------------------------------------------------
>
>                 Key: HADOOP-8045
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8045
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.21.0, 1.0.0
>         Environment: Cloudera CH3 release.
>            Reporter: Tarjei Huse
>              Labels: patch
>         Attachments: hadoop-multiple-outputs.patch
>
>
> We were tryong to use MultipleOutputs to write one file per key. This 
> produced the error:
> exception:
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
> /user/me/part6/_temporary/
> _attempt_201202071305_0017_r_000000_2/2011-11-18-22-
> attempt_201202071305_0017_r_000000_2-r-00000
> could only be replicated to 0 nodes, instead of 1
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:
> 1520)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:
> 665)
>     at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
> 25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1434)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1430)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:396)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:
> 1157) 
> When the nr. of files processed increased over 20 on a single developer 
> system. 
> The solution proved to be to close each RecordWriter when the reducer was 
> finished with a key, something that required that we extended the multiple 
> outputs to fetch the recordwriter - not a good solution. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HADOOP-8045) org.apache.hadoop.mapreduce.lib.output.MultipleOutputs does not handle many files well

Reply via email to