[ 
https://issues.apache.org/jira/browse/PIG-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580897#action_12580897
 ] 

Alan Gates commented on PIG-160:
--------------------------------

I think that the goal should be that pig works with 4 different types of 
metadata:

1) None.  This case we already handle.
2) User supplied, as suggested in the parser changes in PIG-159.
3) Self describing data, such as JSON.
4) External meta data services.

By the end of these changes, pig will be able to handle 1-3.  4 is somewhere in 
the distant future.

I envision it works something like this:

1) a pig script is parsed and a logical plan constructed.
2) determineSchema is called on the appropriate LoadFunc to see if it can 
determine the schema.  The vast majority of loader functions will just return a 
null here.  But loader functions for self-describing data will need to read 
enough of the data to determine the schema and return it.
3) If the user provided a schema, that will be assigned in the type checking 
phase.
4) In cases where no metadata is provided, but users treat a field as if it is 
of a certain type (eg pass it to SUM or do a concat on it) then as part of type 
checking, an on the fly conversion of that field to the most general type that 
can satisfy the expression (double for numerics) will need to be inserted.

If a user provides a schema for self describing data (ie we have multiple 
sources) I don't know how we handle that.  Do we believe the user, believe the 
data, or issue an error and quit?

Another note is that we want to do actual type conversion as late as possible.  
Given the following script:

a = load 'myfile' as userid chararray, total_purchases float;
b = filter a by total_purchases > 1000.00;
c = store b;

We don't want the load function to convert userid and total_purchases at load 
time because we hope to filter out a large number of the records, and because 
we don't actually ever need to convert userid.  However, the loader function is 
the only place we know how to convert the raw bytes to specific types.  So, 
I've added conversions functions in the LoadFunc interface.  But it will be the 
job of the post parse checkers to insert casts at the appropriate point in the 
plan (in this case in the filter) so that the data is the correct type when 
needed.

> Change LoadFunc interface to work with new types
> ------------------------------------------------
>
>                 Key: PIG-160
>                 URL: https://issues.apache.org/jira/browse/PIG-160
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>
> The LoadFunc interface needs to change to support new types.  The load 
> function will need to support two new features:
> 1) type conversion, how to get the bytes read from the source converted to 
> java Integer, Float, String, etc.
> 2) schema discovery, as we want to support self-describing data such JSON, 
> and we will need the load function to tell us that schema.
> The proposed new interface is:
> {code:title=Bar.java|borderStyle=solid}
> /**
>  * This interface is used to implement functions to parse records
>  * from a dataset.  This also includes functions to cast raw byte data into 
> various
>  * datatypes.  These are external functions because we want loaders, whenever
>  * possible, to delay casting of datatypes until the last possible moment 
> (i.e.
>  * don't do it on load).  This means we need to expose the functionality so 
> that
>  * other sections of the code can call back to the loader to do the cast.
>  */
> public interface LoadFunc {
>     /**
>      * Specifies a portion of an InputStream to read tuples. Because the
>      * starting and ending offsets may not be on record boundaries it is up to
>      * the implementor to deal with figuring out the actual starting and 
> ending
>      * offsets in such a way that an arbitrarily sliced up file will be 
> processed
>      * in its entirety.
>      * <p>
>      * A common way of handling slices in the middle of records is to start at
>      * the given offset and, if the offset is not zero, skip to the end of the
>      * first record (which may be a partial record) before reading tuples.
>      * Reading continues until a tuple has been read that ends at an offset 
> past
>      * the ending offset.
>      * <p>
>      * <b>The load function should not do any buffering on the input 
> stream</b>. Buffering will
>      * cause the offsets returned by is.getPos() to be unreliable.
>      *  
>      * @param fileName the name of the file to be read
>      * @param is the stream representing the file to be processed, and which 
> can also provide its position.
>      * @param offset the offset to start reading tuples.
>      * @param end the ending offset for reading.
>      * @throws IOException
>      */
>     public void bindTo(String fileName,
>                        BufferedPositionedInputStream is,
>                        long offset,
>                        long end) throws IOException;
>     /**
>      * Retrieves the next tuple to be processed.
>      * @return the next tuple to be processed or null if there are no more 
> tuples
>      * to be processed.
>      * @throws IOException
>      */
>     public Tuple getNext() throws IOException;
>     
>     /**
>      * Cast data from bytes to boolean value.  
>      * @param bytes byte array to be cast.
>      * @return Boolean value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Boolean bytesToBoolean(byte[] b) throws IOException;
>     
>     /**
>      * Cast data from bytes to integer value.  
>      * @param bytes byte array to be cast.
>      * @return Integer value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Integer bytesToInteger(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to long value.  
>      * @param bytes byte array to be cast.
>      * @return Long value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Long bytesToLong(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to float value.  
>      * @param bytes byte array to be cast.
>      * @return Float value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Float bytesToFloat(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to double value.  
>      * @param bytes byte array to be cast.
>      * @return Double value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Double bytesToDouble(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to chararray value.  
>      * @param bytes byte array to be cast.
>      * @return String value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public String bytesToCharArray(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to map value.  
>      * @param bytes byte array to be cast.
>      * @return Map value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Map<Object, Object> bytesToMap(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to tuple value.  
>      * @param bytes byte array to be cast.
>      * @return Tuple value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public Tuple bytesToTuple(byte[] b) throws IOException;
>     /**
>      * Cast data from bytes to bag value.  
>      * @param bytes byte array to be cast.
>      * @return Bag value.
>      * @throws IOException if the value cannot be cast.
>      */
>     public DataBag bytesToBag(byte[] b) throws IOException;
>     /**
>      * Indicate to the loader fields that will be needed.  This can be useful 
> for
>      * loaders that access data that is stored in a columnar format where 
> indicating
>      * columns to be accessed a head of time will save scans.  If the loader
>      * function cannot make use of this information, it is free to ignore it.
>      * @param schema Schema indicating which columns will be needed.
>      */
>     public void fieldsToRead(Schema schema);
>     /**
>      * Find the schema from the loader.  This function will be called at 
> parse time
>      * (not run time) to see if the loader can provide a schema for the data. 
>  The
>      * loader may be able to do this if the data is self describing (e.g. 
> JSON).  If
>      * the loader cannot determine the schema, it can return a null.
>      * @param fileName Name of the file to be read.
>      * @param in inpu stream, so that the function can read enough of the
>      * data to determine the schema.
>      * @param end Function should not read past this position in the stream.
>      * @return a Schema describing the data if possible, or null otherwise.
>      * @throws IOException.
>      */
>     public Schema determineSchema(String fileName,
>                                   BufferedPositionedInputStream in,
>                                   long end) throws IOException;
> }
> {code} 
> This bug also covers the work to convert existing load function (eg 
> PigStorage, BinStorage) to the new interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to