Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by Arun C Murthy:
http://wiki.apache.org/pig/PigStreamingFunctionalSpec

------------------------------------------------------------------------------
  
  Pig will not ship the files but would expect the files to be available on the 
compute nodes.
  
- If the cache clause has a `#<name>`, then Hadoop's DistributedCache will a 
create a symlink in the task's cwd for the cached file. So, one can use this to 
distribute binaries too.
+ If the cache clause has a `#<name>`, then Hadoop's !DistributedCache will a 
create a symlink in the task's cwd for the cached file. So, one can use this to 
distribute binaries too.
  
  {{{
  define X `./stream.pl` cache('/home/joe/foo#stream.pl')
@@ -306, +306 @@

  
  ==== 4.3 Ability to processing binary data ====
  
- Sometimes, applications need to consume the entire data file without any 
parsing. All we would need in this case is to provide a custom loader function 
that just reads the entire data.
+ Sometimes, applications need to consume the entire data file without any 
parsing. In those cases applications can specify the ''split by 'file' '' 
option to the LoadFunc being used, further they can use ''BinaryStorage'' to 
specify that they do not want Pig to parse data at all and hence directly get 
the raw data.
  
  {{{
- A = load 'data' using AsIsLoader();
+ A = load 'data' using BinaryStorage() split by 'file';
  B = stream A by `stream.pl`
  }}}
  
@@ -345, +345 @@

  
  We should have a performance target in mind as compared to Hadoop streaming. 
I think for the initial release it would make sense to aim for '''30%''' 
overhead for streaming in Pig.
  
+ ==== 5.1 Load/Stream and Stream/Store optimizations ====
+ 
+ In cases where the STREAM operator immediately follows the LOAD or where it 
directly precedes the STORE operator, and given that they have the '''same''' 
LoadFunc/StoreFunc specifications Pig will try and optimize away the 
interpretation of data in the LoadFunc/StoreFunc (i.e. need to breakup raw 
input into ''Tuples'') by substituting the equivalent {Load|Store}Funcs for 
!BinaryStorage. For the LOAD/STREAM case the caveat is that this is feasible 
only when individual tasks are processing all of the data in the given input 
file (i.e. the split by 'file' option is specified to the LOAD operator).
+ 
+ E.g.
+ Pig will optimize:
+ {{{
+ IP = load 'data' split by 'file';
+ OP = stream IP through `myscript`;
+ store OP into 'output';
+ }}}
+ into
+ {{{
+ define CMD `myscript` input(stdin using BinaryStorage()) output(stdout using 
BinaryStorage());
+ IP = load 'data' using BinaryStorage() split by 'file';
+ OP = stream IP through CMD;
+ store OP into 'output' using BinaryStorage();
+ }}}
+ 
+ However,
+ {{{
+ IP = load 'data' using PigStorage(',') split by 'file';
+ OP = stream IP through `myscript`;
+ store OP into 'output';
+ }}}
+ 
+ cannot optimize the LOAD/STREAM pair since they have different !LoadFuncs 
(load has !PigStorage(',') and stream has !PigStorage()). The STREAM/STORE pair 
will be optimized to use !BinaryStorage.
+ 
  [[Anchor(Referencs)]]
  == References ==
  

Reply via email to