Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by AlanGates.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=38&rev2=39

--------------------------------------------------

  !PigStorage.putNext().  While it will not be possible to use PigStorage 
directly every effort should be made to share this code (most likely by putting 
the actual code in static
  utility methods that can be called by each class) to avoid double code 
maintenance costs.
  
+ It has been suggested that we should switch to the typed bytes protocol that 
is available in Hadoop and Hive (see
+     
https://issues.apache.org/jira/browse/PIG-966?focusedCommentId=12781695&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12781695
 ).  While we cannot switch the default, we can make this streaming
+ connection an interface so that users can easily extend it in the future.  
The interface should be quite simple:
+ 
+ {{{
+     interface PigToStream {
+ 
+         /**
+          * Given a tuple, produce an array of bytes to be passed to the 
streaming
+          * executable.
+          */
+         public byte[] serialize(Tuple t) throws IOException;
+ 
+         /**
+          * Set the record delimiter to use when communicating with the 
streaming
+          * executable.  The default if this is not set is \n.
+          */
+         public void setRecordDelimiter(byte delimiter);
+     }
+ 
+     interface StreamToPig {
+ 
+         /**
+          *  Given a byte array from a streaming executable, produce a tuple.
+          */
+         public Tuple deserialize(byte[]) throws IOException;
+ 
+         /**
+          * Set the record delimiter to use when reading from the streaming
+          * executable.  The default if this is not set is \n.
+          */
+         public void setRecordDelimiter(byte delimiter);
+     }
+ }}}
+ 
+ The default implementation of this would be as suggested above.  The syntax 
for describing how data is (de)serialized would then stay as it currently is, 
except instead of giving a
+ !StoreFunc the user would specify a !PigToStream, and instead of specifying a 
!LoadFunc a !StreamToPig.
+ 
+ Additionally, it has been noted that this change takes away the current 
optimization of Pig Latin scripts such as the following:
+ 
+ {{{
+ A = load 'myfile' split by 'file';
+ B = stream A through 'mycmd';
+ store B into 'outfile';
+ }}}
+ 
+ In this case Pig will optimize the query by removing the load function and 
replacing it with !BinaryStorage, a function which simply passes the data as is 
to the streaming
+ executable.  It does not record or field parsing.  Similarly, the store in 
the above script would be replaced with !BinaryStorage.
+ 
+ We have two options to replace this.  First, we could say that if a class 
implementing !PigToStream also implements !InputFormat, then Pig will drop the 
Load statement and use that
+ !InputFormat directly to load data and then pass the results to the stream.  
The same would be done with !StreamToPig, !OutputFormat and store.  Second, we 
could create
+ !IdentityLoader and !IdentityStreamToPig functions.  !IdentityLoader.getNext 
would return a tuple that just had one bytearray, which would be the entire 
record.  This would then be a
+ trivial serialization via the default !PigToStream.  Similarly 
!IdentityStreamToPig would take the bytes returned by the stream and put them 
in a tuple of a single bytearray.  The
+ store function would then naturally translate this tuple into the underlying 
bytes.
+ Functionally these are basically equivalent, since Pig would need to write 
code similar to the !IdentityLoader etc. for the second case.  So I believe the 
primary difference is in
+ how it is presented to the user not the functionality or code written 
underneath.
+ 
+ Both of these approaches suffer from the problem that they assume 
!TextInputFormat and !TextOutputFormat.  For any other IF/OF it will not be 
clear how to parse key, value
+ pairs out of the stream data.
+ 
+ This optimization represents a fair amount of work.  As the current 
optimization is not documented, it is not clear how many users are using it.  
Based on that I vote that we
+ do not implement this optimization until such time as we see a need for it.
  
  === Remaining Tasks ===
   * !BinStorage needs to implement !LoadMetadata's getSchema() to replace 
current determineSchema()
@@ -709, +771 @@

  Nov 23 2009, Dmitriy Ryaboy
   * updated StoreMetadata to match changes made to LoadMetadata
  
+ Nov 25 2009, Gates
+  * Updated section on streaming to suggest creating an interface for 
streaming (de)serializers rather than having only one hardwired option.  Also 
added some thoughts on possible replacements for the current 
!BinaryStorage/split by file optimization.
+ 

Reply via email to