Siddharth Seth commented on TEZ-3215:

Couple of minor comments.
- Missing @override annotation on flush in MROutputs
- newRecordWriter / oldRecordWriter will be setup when MROutput.initialize is 
called. Think this is avoidable.
- Could be called MultiMROutput - similar to MultiMRInput (which deals with 
multiple readers). Up to you if you want to change this.
Any changes required to the associated OutputCommitter?
Otherwise, looks good to me.

> Support for MultipleOutputs
> ---------------------------
>                 Key: TEZ-3215
>                 URL: https://issues.apache.org/jira/browse/TEZ-3215
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: TEZ-3215-2.patch, TEZ-3215-3.patch, TEZ-3215-4.patch, 
> TEZ-3215-5.patch, TEZ-3215.patch
> Here is the use case. A reducer might write its output to more than one file. 
> The file name will be based on the mapper key. We don't know all possible 
> keys ahead of time. In MR, MultipleOutputs provides such support. I couldn't 
> find anything readily available in Tez.
> * Set up one DataSink per file ahead of time won't work as we don't know all 
> possible keys.
> * Use MR MultipleOutputs directly from the Tez application processor. It 
> isn't clear how to pass TaskInputOutputContext to MultipleOutputs.
> * Tez MROutput can create a DataSink based on the specified outputFormat. But 
> it can't take MR MultipleOutputs.
> I end up modifying Tez MROutput with HashMap {{recordWriters}} to achieve 
> this. If this is a solved problem, can anyone explain how to do it?

This message was sent by Atlassian JIRA

Reply via email to