Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The "LoadStoreRedesignProposal" page has been changed by AlanGates. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=38&rev2=39 -------------------------------------------------- !PigStorage.putNext(). While it will not be possible to use PigStorage directly every effort should be made to share this code (most likely by putting the actual code in static utility methods that can be called by each class) to avoid double code maintenance costs. + It has been suggested that we should switch to the typed bytes protocol that is available in Hadoop and Hive (see + https://issues.apache.org/jira/browse/PIG-966?focusedCommentId=12781695&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12781695 ). While we cannot switch the default, we can make this streaming + connection an interface so that users can easily extend it in the future. The interface should be quite simple: + + {{{ + interface PigToStream { + + /** + * Given a tuple, produce an array of bytes to be passed to the streaming + * executable. + */ + public byte[] serialize(Tuple t) throws IOException; + + /** + * Set the record delimiter to use when communicating with the streaming + * executable. The default if this is not set is \n. + */ + public void setRecordDelimiter(byte delimiter); + } + + interface StreamToPig { + + /** + * Given a byte array from a streaming executable, produce a tuple. + */ + public Tuple deserialize(byte[]) throws IOException; + + /** + * Set the record delimiter to use when reading from the streaming + * executable. The default if this is not set is \n. + */ + public void setRecordDelimiter(byte delimiter); + } + }}} + + The default implementation of this would be as suggested above. The syntax for describing how data is (de)serialized would then stay as it currently is, except instead of giving a + !StoreFunc the user would specify a !PigToStream, and instead of specifying a !LoadFunc a !StreamToPig. + + Additionally, it has been noted that this change takes away the current optimization of Pig Latin scripts such as the following: + + {{{ + A = load 'myfile' split by 'file'; + B = stream A through 'mycmd'; + store B into 'outfile'; + }}} + + In this case Pig will optimize the query by removing the load function and replacing it with !BinaryStorage, a function which simply passes the data as is to the streaming + executable. It does not record or field parsing. Similarly, the store in the above script would be replaced with !BinaryStorage. + + We have two options to replace this. First, we could say that if a class implementing !PigToStream also implements !InputFormat, then Pig will drop the Load statement and use that + !InputFormat directly to load data and then pass the results to the stream. The same would be done with !StreamToPig, !OutputFormat and store. Second, we could create + !IdentityLoader and !IdentityStreamToPig functions. !IdentityLoader.getNext would return a tuple that just had one bytearray, which would be the entire record. This would then be a + trivial serialization via the default !PigToStream. Similarly !IdentityStreamToPig would take the bytes returned by the stream and put them in a tuple of a single bytearray. The + store function would then naturally translate this tuple into the underlying bytes. + Functionally these are basically equivalent, since Pig would need to write code similar to the !IdentityLoader etc. for the second case. So I believe the primary difference is in + how it is presented to the user not the functionality or code written underneath. + + Both of these approaches suffer from the problem that they assume !TextInputFormat and !TextOutputFormat. For any other IF/OF it will not be clear how to parse key, value + pairs out of the stream data. + + This optimization represents a fair amount of work. As the current optimization is not documented, it is not clear how many users are using it. Based on that I vote that we + do not implement this optimization until such time as we see a need for it. === Remaining Tasks === * !BinStorage needs to implement !LoadMetadata's getSchema() to replace current determineSchema() @@ -709, +771 @@ Nov 23 2009, Dmitriy Ryaboy * updated StoreMetadata to match changes made to LoadMetadata + Nov 25 2009, Gates + * Updated section on streaming to suggest creating an interface for streaming (de)serializers rather than having only one hardwired option. Also added some thoughts on possible replacements for the current !BinaryStorage/split by file optimization. +