Apache Wiki
Fri, 04 Apr 2008 02:24:27 -0700
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by Arun C Murthy: http://wiki.apache.org/pig/PigStreamingFunctionalSpec The comment on the change is: Updated ship spec & clarified cache spec ------------------------------------------------------------------------------ If the user does supply a DEFINE for a given streaming command, then the above 'auto-shipping' is turned off. - Note that users can provide arbitrary (absolute/relative) paths to the `ship` spec. Pig will make the supplied path relative to the task's cwd. + Note that the `ship` spec has 2 facets: the source (provided in the ship() clause) is the user's view of his machine, while what is specified as the `command` is the view on the actual cluster. The user has to be aware of these two separate views. The _only_ guarantee offered to the user is that the shipped files are available is the current-working-directory of the launched job and that his cwd is also on the PATH environment variable. + Thus important points to keep in mind: - E.g. - If the ship spec is: ship('../../X/Y/script') then the script will be available at `../../X/Y/script` relative to the cwd of the task, and hence the command should be - defined accordingly i.e. - {{{ - DEFINE CMD `../../X/Y/script` ship(../../X/Y/script); - }}} - If the user provides an absolute path to be shipped, then the binary will be available at the same path relative to his cwd i.e. + 1. If Pig determines that it needs to auto-ship an absolute path e.g. {{{ + OP = stream IP through `/a/b/c/script`; + }}} + or + {{{ + OP = stream IP through `perl /a/b/c/script.pl`; + }}} + it will `not` ship it at all since there is no way to ship files to the necessary location (lack of permissions etc.). + + 2. It is safe only to ship files to be executed from the cwd on the task on the cluster: + {{{ + OP = stream IP through `script`; + }}} + or + {{{ - DEFINE CMD `./X/Y/script` ship(/X/Y/script); + DEFINE CMD `script` ship('/a/b/script'); + OP = stream IP through `script`; }}} - These features are meant to let users ship binaries with the same names, present in different paths. + Shipping files to relative paths or absolute paths is undefined and mostly will fail since users might not have permissions to read/write/execute from arbitraty paths on the actual clusters. + ==== 2.2 Ability to cache data ==== @@ -174, +185 @@ Pig will not ship the files but would expect the files to be available on the compute nodes. - If the cache clause has a `#<name>`, then Hadoop's !DistributedCache will a create a symlink in the task's cwd for the cached file. So, one can use this to distribute binaries too. + If the cache clause has a `#<name>`, then Hadoop's !DistributedCache will a create a symlink in the task's cwd for the cached file. So, one can use this to distribute binaries too since cwd is on the PATH. Note that the symlink feature `has to be` used by Pig-Streaming users since they cannot predict the actual path of the cached file on the cluster nodes, the symlink is always in the cwd. {{{ define X `./stream.pl` cache('/home/joe/foo#stream.pl')