[jira] Commented: (PIG-652) Need to give user control of OutputFormat

Alan Gates (JIRA) Thu, 05 Feb 2009 17:38:25 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670977#action_12670977
 ]


Alan Gates commented on PIG-652:
--------------------------------

I propose that we add a method to the StoreFunc interface:

{code}
    /**
     * Specify a backend specific class to use to prepare for
     * storing output.  In the Hadoop case, this can return an
     * OutputFormat that will be used instead of PigOutputFormat.  The 
     * framework will call this function and if a Class is returned
     * that implements OutputFormat it will be used.
     * @return Backend specific class used to prepare for storing output.
     * @throws IOException if the class does not implement the expected
     * interface(s).
     */
    public Class getStorePreparationClass() throws IOException;
{code}

This way we are not forced to write a whole pig copy of OutputFormat and 
RecordWriter interfaces (the way Slicer and Slice copy InputFormat, InputSplit, 
and RecordReader) while still avoiding importing hadoop classes into our 
interface.  It also avoids forcing the StoreFunc to also be RecordWriter (the 
way LoadFunc has to implement Slicer).

The downside of this is that we do not allow Pig Latin to change to allow a 
construct like:

{code}
store A using MyStoreFunc() using format MyOutputFormat()
{code}

There would be an advantage of to this.  For example, if one wanted to 
serialize tuples over a socket, you might still want to use PigStorage but 
create a SocketOutputFormat function.  In the currently proposed interface you 
could still accomplish this by writing a StoreFunc that subclasses PigStorage 
and implements the getStorePreparationClass(), but this is less elegant.  

As far as I know no one is currently asking for the ability to specify 
OutputFormat separate from StoreFunc, and doing so would necessitate creating 
pig copies of OutputFormat and RecordWriter.  So rather than create a lot of 
extra interfaces for functionality no one is requesting I propose this simpler 
solution.  If, in the future we choose to allow the ability to separate the 
two, we would still want a StoreFunc to be able to specify its OutputFormat, so 
the proposed functionality would not be deprecated.

> Need to give user control of OutputFormat
> -----------------------------------------
>
>                 Key: PIG-652
>                 URL: https://issues.apache.org/jira/browse/PIG-652
>             Project: Pig
>          Issue Type: New Feature
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>
> Pig currently allows users some control over InputFormat via the Slicer and 
> Slice interfaces.  It does not allow any control over OutputFormat and 
> RecordWriter interfaces.  It just allows the user to implement a storage 
> function that controls how the data is serialized.  For hadoop tables, we 
> will need to allow custom OutputFormats that prepare output information and 
> objects needed by a Table store function.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-652) Need to give user control of OutputFormat

Reply via email to