[jira] [Commented] (TEZ-3215) Support for MultipleOutputs

Ming Ma (JIRA) Tue, 19 Apr 2016 20:25:38 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249184#comment-15249184
 ]


Ming Ma commented on TEZ-3215:
------------------------------

Maybe the usage could be something like this, once a mapper finishes processing 
its input, it knows all the possible Outputs reducers need to have. Then the 
mappers will notify AM all the unique category names so that AM can dynamically 
add Outputs at runtime. At the reducer processor side, it then queries the 
context for the specific Output name based on the category key received. Does 
seems unnecessarily complicated.

I completely with you that having one Output write to multiple HDFS files is 
the best approach. I brought up the dynamic output approach just to discuss 
alternative solutions before writing code.

Thanks!

> Support for MultipleOutputs
> ---------------------------
>
>                 Key: TEZ-3215
>                 URL: https://issues.apache.org/jira/browse/TEZ-3215
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Ming Ma
>
> Here is the use case. A reducer might write its output to more than one file. 
> The file name will be based on the mapper key. We don't know all possible 
> keys ahead of time. In MR, MultipleOutputs provides such support. I couldn't 
> find anything readily available in Tez.
> * Set up one DataSink per file ahead of time won't work as we don't know all 
> possible keys.
> * Use MR MultipleOutputs directly from the Tez application processor. It 
> isn't clear how to pass TaskInputOutputContext to MultipleOutputs.
> * Tez MROutput can create a DataSink based on the specified outputFormat. But 
> it can't take MR MultipleOutputs.
> I end up modifying Tez MROutput with HashMap {{recordWriters}} to achieve 
> this. If this is a solved problem, can anyone explain how to do it?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-3215) Support for MultipleOutputs

Reply via email to