[
https://issues.apache.org/jira/browse/MAPREDUCE-370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Amareshwari Sriramadasu updated MAPREDUCE-370:
----------------------------------------------
Attachment: patch-370.txt
Attaching an early patch.
Patch does the following:
1. Adds an api in org.apache.hadoop.mapreduce.lib.output.FileOutputFormat to
get RecordWriter by taking the filename. Current api does not support passing a
filename.
2. Adds org.apache.hadoop.mapreduce.lib.output.MultipleOutputs with following
api :
{code}
public class MultipleOutputs<KEYOUT, VALUEOUT> {
public MultipleOutputs(TaskInputOutputContext context);
// Adds a named output for the job.
public static void addNamedOutput(Job job, String namedOutput,
Class<? extends FileOutputFormat> outputFormatClass,
Class<?> keyClass, Class<?> valueClass) ;
// Enables counters for named outputs
public static void setCountersEnabled(Job job, boolean enabled);
// Write to a named output.
// write to an output file name that depends on key, value, context and
namedoutput
// gets the record writer from output format added for the named output
public <K,V> void write(String namedOutput, K key, V value)
throws IOException, InterruptedException;
// Writes to an output file name that depends on key, value and context
// gets the record writer from job's outputformat.
//Job's output format should be a FileOutputFormat.
public void write(KEYOUT key, VALUEOUT value)
throws IOException, InterruptedException;
protected <K,V>String generateOutputName(K key, V value,
TaskAttemptContext context, String name);
protected <K,V> K generateActualKey(K key, V value) ;
protected <K,V> V generateActualValue(K key, V value);
{code}
User can add namedOutputs and corresponding OutputFormat, Output key/value
types using addNamedOutput.
generateOutputName api can be overridden by the user to give final output name.
This gives the complete control of the output name to the user. Generating
unique file-name can done once user gives this name (can be done in framework
it self) as done in the patch. This facilitates the available counter feature
to count the number of records written to each output name. The same method can
be used to plug-in the functionality of multiNamedOutputs.
I illustrated using the api, in the added test-case.
3. Deprecates org.apache.hadoop.mapred.lib.Multiple*Output*
> Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.
> -------------------------------------------------------------------
>
> Key: MAPREDUCE-370
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-370
> Project: Hadoop Map/Reduce
> Issue Type: Sub-task
> Reporter: Amareshwari Sriramadasu
> Assignee: Amareshwari Sriramadasu
> Attachments: patch-370.txt
>
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.