[jira] Commented: (HADOOP-3149) supporting multiple outputs for M/R jobs

Alejandro Abdelnur (JIRA) Fri, 04 Apr 2008 07:04:00 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12585536#action_12585536
 ]


Alejandro Abdelnur commented on HADOOP-3149:
--------------------------------------------

My concern with the KeyValue class and how it would be used is that is not and 
intuitive use of the outputcollector. The expectation is that the 
collect(key,value) method takes a key and a value, with the use of the KeyValue 
then the key is the named output and the value is the key/value to output. IMO 
this is awkward.

While the MultipleOutputFormat provides a powerful mechanism to use the 
key/value to redirect the output in very clever ways, still I think there 
should be a simple API to just write key/values to different outputs. And, as 
you suggested, based on the MultipleOutputFormat.

Regarding the finalPath generation and the 2 steps, agree that is confusing how 
things are done. Still I think the 2 steps are needed, one to generate the 
filename without partition info (to avoid name collision) and the second to add 
teh partition info (to avoid name collision). Probably the current method names 
are not correct and thus the confussion it generates.

Also, in the getRecordWriter() in the anonymous RecordWriter that is being 
returned, the TreeMap used to store the different record writers, it is being 
keyed off using the finalPath of the file, shouldn't be using the keyBasedPath? 

> supporting multiple outputs for M/R jobs
> ----------------------------------------
>
>                 Key: HADOOP-3149
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3149
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Assignee: Alejandro Abdelnur
>             Fix For: 0.17.0
>
>         Attachments: patch3149.txt, patch3149.txt, patch3149.txt, 
> patch3149.txt
>
>
> The outputcollector supports writing data to a single output, the 'part' 
> files in the output path.
> We found quite common that our M/R jobs have to write data to different 
> output. For example when classifying data as NEW, UPDATE, DELETE, NO-CHANGE 
> to later do different processing on it.
> Handling the initialization of additional outputs from within the M/R code 
> complicates the code and is counter intuitive with the notion of job 
> configuration.
> It would be desirable to:
> # Configure the additional outputs in the jobconf, potentially specifying 
> different outputformats, key and value classes for each one.
> # Write to the additional outputs in a similar way as data is written to the 
> outputcollector.
> # Support the speculative execution semantics for the output files, only 
> visible in the final output for promoted tasks.
> To support multiple outputs the following classes would be added to 
> mapred/lib:
> * {{MOJobConf}} : extends {{JobConf}} adding methods to define named outputs 
> (name, outputformat, key class, value class)
> * {{MOOutputCollector}} : extends {{OutputCollector}} adding a 
> {{collect(String outputName, WritableComparable key, Writable value)}} method.
> * {{MOMapper}} and {{MOReducer}}: implement {{Mapper}} and {{Reducer}} adding 
> a new {{configure}}, {{map}} and {{reduce}} signature that take the 
> corresponding {{MO}} classes and performs the proper initialization.
> The data flow behavior would be: key/values written to the default (unnamed) 
> output (using the original OutputCollector {{collect}} signature) take part 
> of the shuffle/sort/reduce processing phases. key/values written to a named 
> output from within a map don't.
> The named output files would be named using the task type and task ID to 
> avoid collision among tasks (i.e. 'new-m-00002' and 'new-r-00001').
> Together with the setInputPathFilter feature introduced by HADOOP-2055 it 
> would become very easy to chain jobs working on particular named outputs 
> within a single directory.
> We are using heavily this pattern and it greatly simplified our M/R code as 
> well as chaining different M/R. 
> We wanted to contribute this back to Hadoop as we think is a generic feature 
> many could benefit from.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3149) supporting multiple outputs for M/R jobs

Reply via email to