[jira] Updated: (MAPREDUCE-370) Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.

Amareshwari Sriramadasu (JIRA) Fri, 07 Aug 2009 04:18:40 -0700

     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Amareshwari Sriramadasu updated MAPREDUCE-370:
----------------------------------------------

    Attachment: patch-370.txt

Attaching an early patch.

Patch does the following:
1. Adds an api in org.apache.hadoop.mapreduce.lib.output.FileOutputFormat to 
get RecordWriter by taking the filename. Current api does not support passing a 
filename.

2. Adds org.apache.hadoop.mapreduce.lib.output.MultipleOutputs with following 
api :
{code}
public class MultipleOutputs<KEYOUT, VALUEOUT>  {

  public MultipleOutputs(TaskInputOutputContext context);

   // Adds a named output for the job.
  public static void addNamedOutput(Job job, String namedOutput,
      Class<? extends FileOutputFormat> outputFormatClass,
      Class<?> keyClass, Class<?> valueClass) ;

  // Enables counters for named outputs
  public static void setCountersEnabled(Job job, boolean enabled);

  // Write to a named output. 
  // write to an output file name that depends on key, value, context and 
namedoutput
  // gets the record writer from output format added for the named output 
  public <K,V> void write(String namedOutput, K key, V value)
          throws IOException, InterruptedException;

  // Writes to  an output file name that depends on key, value and context
  // gets the record writer from job's outputformat.  
  //Job's output format should be a FileOutputFormat. 
  public  void write(KEYOUT key, VALUEOUT value) 
          throws IOException, InterruptedException;

  protected <K,V>String generateOutputName(K  key, V value,
      TaskAttemptContext context, String name);

  protected <K,V> K generateActualKey(K key, V value) ;
  protected <K,V> V generateActualValue(K key, V value);
{code}

User can add namedOutputs and corresponding OutputFormat, Output key/value 
types using addNamedOutput. 
generateOutputName api can be overridden by the user to give final output name. 
This gives the complete control of the output name to the user. Generating 
unique file-name can done once user gives this name (can be done in framework 
it self) as done in the patch. This facilitates the available counter feature 
to count the number of records written to each output name. The same method can 
be used to plug-in the functionality of multiNamedOutputs.

I illustrated using the api, in the added test-case. 

3. Deprecates org.apache.hadoop.mapred.lib.Multiple*Output*



> Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.
> -------------------------------------------------------------------
>
>                 Key: MAPREDUCE-370
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-370
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>            Reporter: Amareshwari Sriramadasu
>            Assignee: Amareshwari Sriramadasu
>         Attachments: patch-370.txt
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAPREDUCE-370) Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.

Reply via email to