[jira] Resolved: (PIG-160) Change LoadFunc interface to work with new types

Alan Gates (JIRA) Tue, 20 Jan 2009 14:49:26 -0800

     [ 
https://issues.apache.org/jira/browse/PIG-160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alan Gates resolved PIG-160.
----------------------------

    Resolution: Fixed

Done as part of the types work, which is now in trunk.

> Change LoadFunc interface to work with new types
> ------------------------------------------------
>
>                 Key: PIG-160
>                 URL: https://issues.apache.org/jira/browse/PIG-160
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>         Attachments: loadfuncs_v1.patch
>
>
> The LoadFunc interface needs to change to support new types.  The load 
> function will need to support two new features:
> 1) type conversion, how to get the bytes read from the source converted to 
> java Integer, Float, String, etc.
> 2) schema discovery, as we want to support self-describing data such JSON, 
> and we will need the load function to tell us that schema.
> The proposed new interface is:
> {code:title=Bar.java|borderStyle=solid}
> /**
>  * This interface is used to implement functions to parse records
>  * from a dataset.  This also includes functions to cast raw byte data into 
> various
>  * datatypes.  These are external functions because we want loaders, whenever
>  * possible, to delay casting of datatypes until the last possible moment 
> (i.e.
>  * don't do it on load).  This means we need to expose the functionality so 
> that
>  * other sections of the code can call back to the loader to do the cast.
>  */
> public interface LoadFunc {
>     /**
>      * Specifies a portion of an InputStream to read tuples. Because the
>      * starting and ending offsets may not be on record boundaries it is up to
>      * the implementor to deal with figuring out the actual starting and 
> ending
>      * offsets in such a way that an arbitrarily sliced up file will be 
> processed
>      * in its entirety.
>      * <p>
>      * A common way of handling slices in the middle of records is to start at
>      * the given offset and, if the offset is not zero, skip to the end of the
>      * first record (which may be a partial record) before reading tuples.
>      * Reading continues until a tuple has been read that ends at an offset 
> past
>      * the ending offset.
>      * <p>
>      * <b>The load function should not do any buffering on the input 
> stream</b>. Buffering will
>      * cause the offsets returned by is.getPos() to be unreliable.
>      *  
>      * @param fileName the name of the file to be read
>      * @param is the stream representing the file to be processed, and which 
> can also provide its position.
>      * @param offset the offset to start reading tuples.
>      * @param end the ending offset for reading.
>      * @throws IOException
>      */
>     public void bindTo(String fileName,
>                        BufferedPositionedInputStream is,
>                        long offset,
>                        long end) throws IOException;
>     /**
>      * Retrieves the next tuple to be processed.
>      * @return the next tuple to be processed or null if there are no more 
> tuples
>      * to be processed.
>      * @throws IOException
>      */
>     public Tuple getNext() throws IOException;
>     
>     /**
>      * Cast data from bytes to boolean value.  
>      * @param bytes byte array to be cast.
>      * @return Boolean value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Boolean bytesToBoolean(byte[] b) throws IOException;
>     
>     /**
>      * Cast data from bytes to integer value.  
>      * @param bytes byte array to be cast.
>      * @return Integer value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Integer bytesToInteger(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to long value.  
>      * @param bytes byte array to be cast.
>      * @return Long value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Long bytesToLong(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to float value.  
>      * @param bytes byte array to be cast.
>      * @return Float value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Float bytesToFloat(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to double value.  
>      * @param bytes byte array to be cast.
>      * @return Double value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Double bytesToDouble(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to chararray value.  
>      * @param bytes byte array to be cast.
>      * @return String value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public String bytesToCharArray(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to map value.  
>      * @param bytes byte array to be cast.
>      * @return Map value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Map<Object, Object> bytesToMap(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to tuple value.  
>      * @param bytes byte array to be cast.
>      * @return Tuple value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Tuple bytesToTuple(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to bag value.  
>      * @param bytes byte array to be cast.
>      * @return Bag value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public DataBag bytesToBag(byte[] b) throws IOException;
>     /**
>      * Indicate to the loader fields that will be needed.  This can be useful 
> for
>      * loaders that access data that is stored in a columnar format where 
> indicating
>      * columns to be accessed a head of time will save scans.  If the loader
>      * function cannot make use of this information, it is free to ignore it.
>      * @param schema Schema indicating which columns will be needed.
>      */
>     public void fieldsToRead(Schema schema);
>     /**
>      * Find the schema from the loader.  This function will be called at 
> parse time
>      * (not run time) to see if the loader can provide a schema for the data. 
>  The
>      * loader may be able to do this if the data is self describing (e.g. 
> JSON).  If
>      * the loader cannot determine the schema, it can return a null.
>      * @param fileName Name of the file to be read.
>      * @param in inpu stream, so that the function can read enough of the
>      * data to determine the schema.
>      * @param end Function should not read past this position in the stream.
>      * @return a Schema describing the data if possible, or null otherwise.
>      * @throws IOException.
>      */
>     public Schema determineSchema(String fileName,
>                                   BufferedPositionedInputStream in,
>                                   long end) throws IOException;
> }
> {code} 
> This bug also covers the work to convert existing load function (eg 
> PigStorage, BinStorage) to the new interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PIG-160) Change LoadFunc interface to work with new types

Reply via email to