[
https://issues.apache.org/jira/browse/TEZ-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243978#comment-15243978
]
Hitesh Shah commented on TEZ-3215:
----------------------------------
The current approach has been to configure the vertex in question with multiple
sinks. The main question is whether the processor thinks there is one Output
object being written to or multiple Output objects.
>From the looks of the description, it seems the processor doesnt know all
>options upfront so it will likely need to write through one Output object and
>internally the output can multiplex to appropriate files as needed. This
>should be doable by creating say a MultiMROutput class which can either re-use
>code from MROutput or extend it as needed. The MROutputCommitter in use today
>with MROutput may need a corresponding MultiMROutputCommitter changing
>depending on the approach taken for how the files are being written.
> Support for MultipleOutputs
> ---------------------------
>
> Key: TEZ-3215
> URL: https://issues.apache.org/jira/browse/TEZ-3215
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Ming Ma
>
> Here is the use case. A reducer might write its output to more than one file.
> The file name will be based on the mapper key. We don't know all possible
> keys ahead of time. In MR, MultipleOutputs provides such support. I couldn't
> find anything readily available in Tez.
> * Set up one DataSink per file ahead of time won't work as we don't know all
> possible keys.
> * Use MR MultipleOutputs directly from the Tez application processor. It
> isn't clear how to pass TaskInputOutputContext to MultipleOutputs.
> * Tez MROutput can create a DataSink based on the specified outputFormat. But
> it can't take MR MultipleOutputs.
> I end up modifying Tez MROutput with HashMap {{recordWriters}} to achieve
> this. If this is a solved problem, can anyone explain how to do it?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)