Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by Arun C Murthy:
http://wiki.apache.org/pig/PigStreamingFunctionalSpec

------------------------------------------------------------------------------
     * If the first word on the streaming command is `perl` or `python`, pig 
would assume that the binary is the first non-quoted string it encounters that 
does not start with dash.
     * Otherwise, pig will attempt to ship the first string from the command 
line as long as it does not come from `/bin, /usr/bin, /usr/local/bin`. It will 
determine that by scanning the path if an absolute path is provided or by 
executing `which`. The paths can be made configurable via `set stream.skippath 
<path>` option. (Users can use multiple `set` commands for specifying more than 
one path to skip.)
  
- To prevent a command from being shipped, an empty list can be passed to 
`ship` clause.
- 
  Note that we need to make sure that executables retain their permissions and 
can be executed on the compute nodes. 
  
  If the user does supply a DEFINE for a given streaming command, then the 
above 'auto-shipping' is turned off.
  
+ Note that users can provide arbitrary (absolute/relative) paths to the `ship` 
spec. Pig will make the supplied path relative to the task's cwd.
+ 
+ E.g.
+ If the ship spec is: ship('../../X/Y/script') then the script will be 
available at `../../X/Y/script` relative to the cwd of the task, and hence the 
command should be
+ defined accordingly i.e.
+ {{{
+ DEFINE CMD `../../X/Y/script` ship(../../X/Y/script);
+ }}}
+ 
+ If the user provides an absolute path to be shipped, then the binary will be 
available at the same path relative to his cwd i.e.
+ {{{
+ DEFINE CMD `./X/Y/script` ship(/X/Y/script);
+ }}}
+ 
+ These features are meant to let users ship binaries with the same names, 
present in different paths.
+ 
  ==== 2.2 Ability to cache data ====
  
  The approach described above works fine for binaries/jars and small data 
sets. For larger datasets, loading them at run time for every execution can 
have serious performance consequences. 
@@ -159, +173 @@

  }}}
  
  Pig will not ship the files but would expect the files to be available on the 
compute nodes.
+ 
+ If the cache clause has a `#<name>`, then Hadoop's DistributedCache will a 
create a symlink in the task's cwd for the cached file. So, one can use this to 
distribute binaries too.
+ 
+ {{{
+ define X `./stream.pl` cache('/home/joe/foo#stream.pl')
+ }}}
  
  [[Anchor(Input/Output_Handling)]]
  === 3 Input/Output Handling ===

Reply via email to