[ https://issues.apache.org/jira/browse/PIG-160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alan Gates resolved PIG-160. ---------------------------- Resolution: Fixed Done as part of the types work, which is now in trunk. > Change LoadFunc interface to work with new types > ------------------------------------------------ > > Key: PIG-160 > URL: https://issues.apache.org/jira/browse/PIG-160 > Project: Pig > Issue Type: Sub-task > Components: impl > Reporter: Alan Gates > Assignee: Alan Gates > Attachments: loadfuncs_v1.patch > > > The LoadFunc interface needs to change to support new types. The load > function will need to support two new features: > 1) type conversion, how to get the bytes read from the source converted to > java Integer, Float, String, etc. > 2) schema discovery, as we want to support self-describing data such JSON, > and we will need the load function to tell us that schema. > The proposed new interface is: > {code:title=Bar.java|borderStyle=solid} > /** > * This interface is used to implement functions to parse records > * from a dataset. This also includes functions to cast raw byte data into > various > * datatypes. These are external functions because we want loaders, whenever > * possible, to delay casting of datatypes until the last possible moment > (i.e. > * don't do it on load). This means we need to expose the functionality so > that > * other sections of the code can call back to the loader to do the cast. > */ > public interface LoadFunc { > /** > * Specifies a portion of an InputStream to read tuples. Because the > * starting and ending offsets may not be on record boundaries it is up to > * the implementor to deal with figuring out the actual starting and > ending > * offsets in such a way that an arbitrarily sliced up file will be > processed > * in its entirety. > * <p> > * A common way of handling slices in the middle of records is to start at > * the given offset and, if the offset is not zero, skip to the end of the > * first record (which may be a partial record) before reading tuples. > * Reading continues until a tuple has been read that ends at an offset > past > * the ending offset. > * <p> > * <b>The load function should not do any buffering on the input > stream</b>. Buffering will > * cause the offsets returned by is.getPos() to be unreliable. > * > * @param fileName the name of the file to be read > * @param is the stream representing the file to be processed, and which > can also provide its position. > * @param offset the offset to start reading tuples. > * @param end the ending offset for reading. > * @throws IOException > */ > public void bindTo(String fileName, > BufferedPositionedInputStream is, > long offset, > long end) throws IOException; > /** > * Retrieves the next tuple to be processed. > * @return the next tuple to be processed or null if there are no more > tuples > * to be processed. > * @throws IOException > */ > public Tuple getNext() throws IOException; > > /** > * Cast data from bytes to boolean value. > * @param bytes byte array to be cast. > * @return Boolean value. > * @throws IOException if the value cannot be cast. > */ > public Boolean bytesToBoolean(byte[] b) throws IOException; > > /** > * Cast data from bytes to integer value. > * @param bytes byte array to be cast. > * @return Integer value. > * @throws IOException if the value cannot be cast. > */ > public Integer bytesToInteger(byte[] b) throws IOException; > /** > * Cast data from bytes to long value. > * @param bytes byte array to be cast. > * @return Long value. > * @throws IOException if the value cannot be cast. > */ > public Long bytesToLong(byte[] b) throws IOException; > /** > * Cast data from bytes to float value. > * @param bytes byte array to be cast. > * @return Float value. > * @throws IOException if the value cannot be cast. > */ > public Float bytesToFloat(byte[] b) throws IOException; > /** > * Cast data from bytes to double value. > * @param bytes byte array to be cast. > * @return Double value. > * @throws IOException if the value cannot be cast. > */ > public Double bytesToDouble(byte[] b) throws IOException; > /** > * Cast data from bytes to chararray value. > * @param bytes byte array to be cast. > * @return String value. > * @throws IOException if the value cannot be cast. > */ > public String bytesToCharArray(byte[] b) throws IOException; > /** > * Cast data from bytes to map value. > * @param bytes byte array to be cast. > * @return Map value. > * @throws IOException if the value cannot be cast. > */ > public Map<Object, Object> bytesToMap(byte[] b) throws IOException; > /** > * Cast data from bytes to tuple value. > * @param bytes byte array to be cast. > * @return Tuple value. > * @throws IOException if the value cannot be cast. > */ > public Tuple bytesToTuple(byte[] b) throws IOException; > /** > * Cast data from bytes to bag value. > * @param bytes byte array to be cast. > * @return Bag value. > * @throws IOException if the value cannot be cast. > */ > public DataBag bytesToBag(byte[] b) throws IOException; > /** > * Indicate to the loader fields that will be needed. This can be useful > for > * loaders that access data that is stored in a columnar format where > indicating > * columns to be accessed a head of time will save scans. If the loader > * function cannot make use of this information, it is free to ignore it. > * @param schema Schema indicating which columns will be needed. > */ > public void fieldsToRead(Schema schema); > /** > * Find the schema from the loader. This function will be called at > parse time > * (not run time) to see if the loader can provide a schema for the data. > The > * loader may be able to do this if the data is self describing (e.g. > JSON). If > * the loader cannot determine the schema, it can return a null. > * @param fileName Name of the file to be read. > * @param in inpu stream, so that the function can read enough of the > * data to determine the schema. > * @param end Function should not read past this position in the stream. > * @return a Schema describing the data if possible, or null otherwise. > * @throws IOException. > */ > public Schema determineSchema(String fileName, > BufferedPositionedInputStream in, > long end) throws IOException; > } > {code} > This bug also covers the work to convert existing load function (eg > PigStorage, BinStorage) to the new interface. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.