Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

New page:
= Parameter Substitution in Pig =

==  Motivation ==

This document describes a proposal for implementing parameter substitution in 
pig. This proposal is motivated by multiple requests from the who would like to 
create a template pig script and then used it with different parameters on a 
regular basis. For instance, if you have daily processing that is identical 
every day except the data it needs to process, it would be very convenient to 
put a placeholder for the date and provide the actual value at run time.

== Requirements ==

1. Ability to have parameters within a pig script and provide values for this 
parameters at run time.
2. Ability to provide parameter values on the command file
3. Ability to provide parameter values in a file
4. Ability to generate parameter value at run time by running a binary or 
script.
5. Ability to provide default values for parameters
6. Ability to retain the script with all parameters resolved. This is mostly 
for debugging purposes.

== Interface ==

=== Parameter Specification ===

Parameters in the script will be enclosed in `%`. The content within `%` can be 
either a parameter name in which case it starts with `$` or a command to run to 
generate the value in which case it will be enclosed in backticks.

'''Example 1''': parameter name specification

{{{
A = load '/data/mydata/%$date%';
B = filter A by $0>'5';
.....
}}}

In this case, pig would expect the program to pass a value for parameter named 
`date` and will place it in the path before executing the load statement.

'''Example 2''': command specification

{{{
A = load '/data/mydata/%`generate_date`%';
B = filter A by $0>'5';
.....
}}}

In this case, pig would execute the program called `generate_date` and place 
its stdout into the path before executing the load statement.

'''Example 3''': command taking a parameter

{{{
A = load '/data/mydata/%`generate_name $date`%';
B = filter A by $0>'5';
.....
}}}

In this case, parameter `date` is substituted first, then `generate_name` 
command ran passing value of `date` as command line parameter and the stdout of 
the command is placed into the path prior to executing the load statement.

'''Example 4''': command is passed as a parameter:

{{{
A = load '/data/mydata/%`$cmd $date`%';
B = filter A by $0>'5';
.....
}}}

In this case, parameters `cmd` and `date` are substituted first. The the 
resulting command is executed and its stdout is placed into the path prior to 
running the load statement.

=== Parameter Passing ===

Parameters can be specified on pig command line using `-param <param>=<val>` 
construct. Multiple parameters can be specified. If the same parameter is 
specified multiple times, the last value will be used and a warning will be 
generated.

The command line for Example 4 above would look as follows:

{{{
pig -param date='20080201' -param cmd='generate_name'
}}}

Parameters can also be specified in a file that can be passed to pig using 
`-param_file <file>` construct. Multiple files can be specified. If the same 
parameter is present multiple times in the file, the last value will be used. 
If a parameter present in multiple files, the value from the last file will be 
used and a warning will be generated.

A parameter file will contain one line per parameter. Empty lines are allowed. 
Perl style (#) comment lines are also allowed. Comment must take the full line 
and `#` must be the first character on the line. Each parameter line would be 
of the form: `<param_name>=<param_value>. White spaces around `=` are allowed 
but are optional. Param value can include white spaces. There is no need to 
quote the value and the quotes will be considered to be part of the value.

The parameter file for Example 4 above would look as follows:

{{{
# my parameters

date = 20080201
cmd = generate_name
}}}

Files and command line parameters can be combined with command line parameters 
taking precedence over files in case of duplicate parameters.

The fault parameter values can be specified in a script using `declare 
<param>=<value>` command:

{{{
declare cmd=generate_name
}}}

Default values are only used if parameters is not specified.

=== Debugging ===

If -debug option is specified to pig, it will produce fully substituted pig 
script in the current working directory named `<original name>.substituted`

A -dryrun option will be added to pig in which case no execution is performed 
and substituted script is produced. We can also use the same option to produce 
just the execution plan.

== Design ==

A C-style preprocessor will be writtem to perform parameter substitution. The 
preprocessor will do the following:

 1. Create  an empty `<original name>.substituted` file in the current working 
directory
 2. Read parameters from files, command line , and declare statement and 
construct a hash preserving the precedence rules in case of duplicates 
described above
 3. For each line in the input script
  * if declare line, skip
  * if comment or empty line, copy over
  * for all other lines
   * parse each part enclosed in `%` and remove `%s`. Locate any identifier 
that starts with `$` and lookup it up in the hash. If found, make the 
substitution; otherwise, report an error and abort.
   * if the part is enclosed in backticks, run the substitued command. If the 
command succeeds, substitute the command with its stdout. If it fails, report 
an error and abort.
   * place the substituted line into the output file.
 4. If -dryrun is not specified, pass the output file to grunt to execute. 
Otherwise, print the name of the file and exit.
 5. if neither -debug nor -dryrun are specified, remove the output file.

== Code Integration ==

TBW

Reply via email to