Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Pig Wiki" for change
The following page has been changed by Arun C Murthy:
* If the first word on the streaming command is `perl` or `python`, pig
would assume that the binary is the first non-quoted string it encounters that
does not start with dash.
* Otherwise, pig will attempt to ship the first string from the command
line as long as it does not come from `/bin, /usr/bin, /usr/local/bin`. It will
determine that by scanning the path if an absolute path is provided or by
executing `which`. The paths can be made configurable via `set stream.skippath
<path>` option. (Users can use multiple `set` commands for specifying more than
one path to skip.)
- To prevent a command from being shipped, an empty list can be passed to
Note that we need to make sure that executables retain their permissions and
can be executed on the compute nodes.
If the user does supply a DEFINE for a given streaming command, then the
above 'auto-shipping' is turned off.
+ Note that users can provide arbitrary (absolute/relative) paths to the `ship`
spec. Pig will make the supplied path relative to the task's cwd.
+ If the ship spec is: ship('../../X/Y/script') then the script will be
available at `../../X/Y/script` relative to the cwd of the task, and hence the
command should be
+ defined accordingly i.e.
+ DEFINE CMD `../../X/Y/script` ship(../../X/Y/script);
+ If the user provides an absolute path to be shipped, then the binary will be
available at the same path relative to his cwd i.e.
+ DEFINE CMD `./X/Y/script` ship(/X/Y/script);
+ These features are meant to let users ship binaries with the same names,
present in different paths.
==== 2.2 Ability to cache data ====
The approach described above works fine for binaries/jars and small data
sets. For larger datasets, loading them at run time for every execution can
have serious performance consequences.
@@ -159, +173 @@
Pig will not ship the files but would expect the files to be available on the
+ If the cache clause has a `#<name>`, then Hadoop's DistributedCache will a
create a symlink in the task's cwd for the cached file. So, one can use this to
distribute binaries too.
+ define X `./stream.pl` cache('/home/joe/foo#stream.pl')
=== 3 Input/Output Handling ===