pig-commits  

[Pig Wiki] Update of "PigStreamingFunctionalSpec" by Arun C Murthy

Apache Wiki
Fri, 04 Apr 2008 02:24:27 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by Arun C Murthy:
http://wiki.apache.org/pig/PigStreamingFunctionalSpec

The comment on the change is:
Updated ship spec & clarified cache spec

------------------------------------------------------------------------------
  
  If the user does supply a DEFINE for a given streaming command, then the 
above 'auto-shipping' is turned off.
  
- Note that users can provide arbitrary (absolute/relative) paths to the `ship` 
spec. Pig will make the supplied path relative to the task's cwd.
+ Note that the `ship` spec has 2 facets: the source (provided in the ship() 
clause) is the user's view of his machine, while what is specified as the 
`command` is the view on the actual cluster. The user has to be aware of these 
two separate views. The _only_ guarantee offered to the user is that the 
shipped files are available is the current-working-directory of the launched 
job and that his cwd is also on the PATH environment variable. 
  
+ Thus important points to keep in mind:
- E.g.
- If the ship spec is: ship('../../X/Y/script') then the script will be 
available at `../../X/Y/script` relative to the cwd of the task, and hence the 
command should be
- defined accordingly i.e.
- {{{
- DEFINE CMD `../../X/Y/script` ship(../../X/Y/script);
- }}}
  
- If the user provides an absolute path to be shipped, then the binary will be 
available at the same path relative to his cwd i.e.
+ 1. If Pig determines that it needs to auto-ship an absolute path e.g.
  {{{
+ OP = stream IP through `/a/b/c/script`;
+ }}}
+ or
+ {{{
+ OP = stream IP through `perl /a/b/c/script.pl`;
+ }}}
+ it will `not` ship it at all since there is no way to ship files to the 
necessary location (lack of permissions etc.).
+ 
+ 2. It is safe only to ship files to be executed from the cwd on the task on 
the cluster:
+ {{{
+ OP = stream IP through `script`;
+ }}}
+ or 
+ {{{
- DEFINE CMD `./X/Y/script` ship(/X/Y/script);
+ DEFINE CMD `script` ship('/a/b/script');
+ OP = stream IP through `script`;
  }}}
  
- These features are meant to let users ship binaries with the same names, 
present in different paths.
+ Shipping files to relative paths or absolute paths is undefined and mostly 
will fail since users might not have permissions to read/write/execute from 
arbitraty paths on the actual clusters.
+ 
  
  ==== 2.2 Ability to cache data ====
  
@@ -174, +185 @@

  
  Pig will not ship the files but would expect the files to be available on the 
compute nodes.
  
- If the cache clause has a `#<name>`, then Hadoop's !DistributedCache will a 
create a symlink in the task's cwd for the cached file. So, one can use this to 
distribute binaries too.
+ If the cache clause has a `#<name>`, then Hadoop's !DistributedCache will a 
create a symlink in the task's cwd for the cached file. So, one can use this to 
distribute binaries too since cwd is on the PATH. Note that the symlink feature 
`has to be` used by Pig-Streaming users since they cannot predict the actual 
path of the cached file on the cluster nodes, the symlink is always in the cwd.
  
  {{{
  define X `./stream.pl` cache('/home/joe/foo#stream.pl')