Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

------------------------------------------------------------------------------
  
  == Motivation ==
  
- This document describes a proposal for implementing parameter substitution in 
pig. This proposal is motivated by multiple requests from users who would like 
to create a template pig script and then use it with different parameters on a 
regular basis. For instance, if you have daily processing that is identical 
every day except the date it needs to process, it would be very convenient to 
put a placeholder for the date and provide the actual value at run time.
+ This document describes a proposal for implementing parameter substitution in 
pig. This proposal is motivated by multiple requests from users who would like 
to create a  
+ template pig script and then use it with different parameters on a regular 
basis. For instance, if you have daily processing that is identical every day 
except the date it  
+ needs to process, it would be very convenient to put a placeholder for the 
date and provide the actual value at run time.
  
  == Requirements ==
  
@@ -17, +19 @@

  
  == Interface ==
  
- === Parameter Specification ===
+ === Using Parameters ===
  
- Parameters in a pig script will be of the form `$<identifier>`. 
+ Parameters in a pig script are in the form of `$<identifier>`. 
  
  {{{
  A = load '/data/mydata/$date';
@@ -27, +29 @@

  .....
  }}}
  
- For this example, pig would expect `date` to be passed from pig command line 
or from a parameter file. The value would be substituted prior to running the 
load statement.
+ In this example, the value of the `date` parameter is expected to be passed 
on each invocation of the script and is substituted in before running the pig 
script. An error  
+ is generated if the value for any parameter is not found.
  
- In addition to supplying parameter value, a user can supply a command to 
execute to generate a parameter value. This can be done using `declare` 
statement. 
+ A parameter name have a structure of a standard language identifier: it must 
start with a latter or underscore followed by any number of letters, digits, 
and underscores. The  
+ names are case insensitive. The names can be escaped with `\` in which case 
substitution does not take place.
+ 
+ In the initial version of the software the parameters are only allowed when 
pig script is specified. They are disabled with `-e` switch or in the 
interactive mode. 
+ 
+ === Specifying Parameters ===
+ 
+ Parameter value can be supplied in four different ways.
+ 
+ ==== Command Line ====
+ 
+ Parameters can be passed via pig command line using `-param <param>=<val>` 
construct. Multiple parameters can be specified. If the same parameter is 
specified multiple  
+ times, the last value will be used and a warning will be generated.
+ 
+ The command line for Example 4 above would look as follows:
+ 
+ {{{
+ pig -param date='20080201'
+ }}}
+ 
+ ==== Parameter File ====
+ 
+ Parameters can also be specified in a file that can be passed to pig using 
`-param_file <file>` construct. Multiple files can be specified. If the same 
parameter is present  
+ multiple times in the file, the last value will be used and a warning will be 
generated. If a parameter present in multiple files, the value from the last 
file will be used  
+ and a warning will be generated.
+ 
+ A parameter file will contain one line per parameter. Empty lines are 
allowed. Perl style (#) comment lines are also allowed. Comments must take a 
full line and `#` must be  
+ the first character on the line. Each parameter line will be of the form: 
`<param_name>=<param_value>`. White spaces around `=` are allowed but are 
optional. 
+ 
+ {{{
+ # my parameters
+ 
+ date = '20080201'
+ cmd = `generate_name`
+ }}}
+ 
+ Files and command line parameters can be combined, with command line 
parameters taking precedence over files in case of duplicate parameters.
+ 
+ ==== Declare Statement ====
+ 
+ `declare` command can be used from within pig script. The use case for this 
is to describe one parameter in terms of other(s).
+ 
+ {{{
+ %declare CMD `$mycmd $date`
+ A = load '/data/mydata/$CMD';
+ B = filter A by $0>'5';
+ .....
+ }}}
+ 
+ The format is `%declare <param> <value>`
+ 
+ `declare` command starts with `%` to indicate that this is a preprocessor 
command that is processed prior to executing pig script. It takes the highest 
precedence. The  
+ scope of parameter value defined via `declare` is all the lines following 
`declare` command until the next `declare` command that defines this parameter 
is encountered.
+ 
+ ==== Default Statement ====
+ 
+ `default` command can be used to provide a default value for a parameter. 
This value is used if the parameter has no value defined by any other means. 
(`default` has the  
+ lowest priority.).
+ 
+ `default` has the format and scoping rules identical do `declare`.
+ 
+ {{{
+ %default DATE '20080101'
+ }}}
+ 
+ ==== Processing Order ====
+ 
+  1. Configuration files will be scanned in the order they are specified on 
the command line. Within each file, the parameters are processed in the order 
they are  
+ specified.
+  2. Command line parameters will be scanned in the order they are specified 
on the command line.
+  3. declare/default commands will be processed in the order they appear in 
the pig script.
+ 
+ ==== Value Format ====
+ 
+ Value format are identical regardless of how the parameter is specified and 
can be of two types. First is a sequence of characters enclosed in single or 
double quotes. In  
+ this case the unquoted version of the value is used during substitution. 
Quotes within the value can be escaped.
+ 
+ {{{
+ %declare DESC 'Joe\'s URL'
+ A = load 'data' as (name, desc, url);
+ B = FILTER A by desc eq '$DESC';
+ }}} 
+ 
+ Note that the constant given to the filter needs to be enclosed in quotes 
because the parameter value is the unquoted version of the string.
+ 
+ Second is a command enclosed in backticks. In this case, the command is 
executed and its `stdout` is used as the parameter value:
  
  {{{
  %declare CMD `generate_date`
@@ -38, +126 @@

  .....
  }}}
  
- For this example, pig would execute `generate_date` command when it 
encounters the `declare` statement and assigns the result (stdout) to parameter 
`CMD`. The value of `CMD` is substituted prior to running the load statement.
+ The values of both types can be expressed in terms of other parameters as 
long a the values of the dependent parameters are defined prior to this value.
  
- `declare` statement starts with `%` to indicate that it is part of the 
preprocessor that performs parameter substitution rather than Pig language 
itself. The declare statement runs till the end of the line unless the value is 
a literal in which case it can take multiple lines.
- 
- `declare` can also be used to define one parameter in terms of others:
- 
- {{{
- %declare param1 ($param2 + $param3)
- }}}
- 
- With exception to string literals that can span multiple lines, for initial 
release, `declare` is a single-line command.
- 
- The command specified within `declare` statement can take parameters which 
need to be substituted as well.
- 
- {{{
- %declare CMD `generate_date $date`
- A = load '/data/mydata/$CMD';
- B = filter A by $0>'5';
- .....
- }}}
- 
- For this example, parameter `date` is substituted first when `declare` 
statement is encountered. Then `generate_name` command is executed passing 
value of `date` as a parameter to it. Its output (stdout) is assigned to `CMD` 
which is used in the load statement prior to its execution.
- 
- Note that variables passed on the command line must be resolved prior to the 
declare statement. The following sequence would cause an error:
- 
- {{{
- %declare A `cmd1 $B`
- %declare $B `cmd2`
- }}}
- 
- Command name itself can be a parameter.
  
  {{{
  %declare CMD `$mycmd $date`
@@ -77, +136 @@

  .....
  }}}
  
- In this example, parameters `mycmd` and `date` are substituted first when 
`declare` statement is encountered. Then the resulting command is executed and 
its stdout is placed into the path prior to running the load statement.
+ In this example, parameters `mycmd` and `date` are substituted first when 
`declare` statement is encountered. Then the resulting command is executed and 
its stdout is  
+ placed into the path prior to running the load statement.
  
- Note that parameter names are case insensitive and $cmd and $CMD means the 
same thing. This is to match the rest of Pig Latin.
- 
- === Parameter Passing ===
- 
- Parameters can be specified on pig command line using `-param <param>=<val>` 
construct. Multiple parameters can be specified. If the same parameter is 
specified multiple times, the last value will be used and a warning will be 
generated.
- 
- The command line for Example 4 above would look as follows:
- 
- {{{
- pig -param date='20080201' -param cmd='generate_name'
- }}}
- 
- Parameters can also be specified in a file that can be passed to pig using 
`-param_file <file>` construct. Multiple files can be specified. If the same 
parameter is present multiple times in the file, the last value will be used 
and a warning will be generated. If a parameter present in multiple files, the 
value from the last file will be used and a warning will be generated.
- 
- A parameter file will contain one line per parameter. Empty lines are 
allowed. Perl style (#) comment lines are also allowed. Comments must take a 
full line and `#` must be the first character on the line. Each parameter line 
will be of the form: `<param_name>=<param_value>`. White spaces around `=` are 
allowed but are optional. A parameter value can include white spaces. There is 
no need to quote the value and the quotes will be considered part of the value.
- 
- The parameter file for Example 4 above would look as follows:
- 
- {{{
- # my parameters
- 
- date = 20080201
- cmd = generate_name
- }}}
- 
- Files and command line parameters can be combined, with command line 
parameters taking precedence over files in case of duplicate parameters.
- 
- `declare` command takes the highest precedence. The scope of parameter value 
defined via `declare` is all the lines following `declare` command until the 
next `declare` command that defines this parameter is encountered.
- 
- Default parameter values can be specified in a script using `%default <param> 
<value>` statement. This statement is identical to `declare` except that it has 
the lowest precedence meaning that its value is only used if it has not been 
defined before. Only first `default` statement for a particular parameter is 
meaningful. The rest are warned on and are ignored.
- 
- {{{
- %default cmd=generate_name
- }}}
- 
- Values specified from the command line as well as configuration file can be 
commands or expressions including other parameters. Their format is identical 
to `declare` and `default` format. Also, the same rule that variables need to 
be resolved before they can be used applies. The following order will be used:
- 
-  1. Configuration files will be scanned in the order they are specified on 
the command line. Within each file, the parameteres are processed in the order 
they are specified.
-  2. Command line parameters will be scanned in the order they are specified 
on the command line.
-  3. declare/default commands will be processed in the order they appear in 
the pig script.
  
  === Debugging ===
  
  If -debug option is specified to pig, it will produce fully substituted pig 
script in the current working directory named `<original name>.substituted`
  
- A -dryrun option will be added to pig in which case no execution is performed 
and substituted script is produced. We can also use the same option to produce 
just the execution plan.
+ A -dryrun option will be added to pig in which case no execution is performed 
and substituted script is produced. We can also use the same option to produce 
just the  
+ execution plan.
  
  === Logging === 
  
- Pig uses apache commons(http://commons.apache.org/logging/) in conjunction 
with log4j(http://logging.apache.org/log4j/) and we should to the same in the 
parameter substitution code.
+ Pig uses apache commons(http://commons.apache.org/logging/) in conjunction 
with log4j(http://logging.apache.org/log4j/) and we should to the same in the 
parameter  
+ substitution code.
  
  The following code can be used to instanciate a logger:
  
@@ -146, +168 @@

  
  Note that this code will work once we integrate this into Pig.
  
- Pig uses INFO as the default log level. Any messages that you want users to 
see during normal operation should be logged at this level. Anything that is 
only useful for debugging, should be logged at DEBUG level. Warnings should be 
logged at WARN level.
+ Pig uses INFO as the default log level. Any messages that you want users to 
see during normal operation should be logged at this level. Anything that is 
only useful for  
+ debugging, should be logged at DEBUG level. Warnings should be logged at WARN 
level.
  
  === Error Handling ===
  
@@ -156, +179 @@

  
   * ParseExceptions - for any errors due to parsing command line or config 
file parameters or pig script.
   * If the underlying code throws an exception and the exception is derived 
from RuntimeException - just let it propagate
-  * If the underlying code throws an exception that is not derived from 
RuntimeException, catch it and throw a RuntimeException with the original 
exception as the cause. (We want to make sure that we don't have to declare 
additional exceptions in our APIs.)
+  * If the underlying code throws an exception that is not derived from 
RuntimeException, catch it and throw a RuntimeException with the original 
exception as the cause. (We  
+ want to make sure that we don't have to declare additional exceptions in our 
APIs.)
   * Any exception that the code originates should be either RuntimeException 
or its derivation if appropriate.
  
  == Design ==
@@ -167, +191 @@

   2. Create `parameter hash` that maps parameter names to parameter values.
   3. Read parameters from files in the order they are specified on the command 
line
   4. `Resolve each parameter`:
-   * search the parameter value for variables that need to be replaced and 
perform replacement if needed. Generate an error and abort if replacement is 
needed but the correspondent parameter is not found in the parameter hash.
+   * search the parameter value for variables that need to be replaced and 
perform replacement if needed. Generate an error and abort if replacement is 
needed but the  
-   * if the parameter value is enclosed in backticks, run the command and 
capture its stdout. If the command succeeds (returns 0), store the parameter in 
the hash with the value equal to stdout of the command. If the command fails 
(returns non-0 value), report the error and abort the processing.
+ correspondent parameter is not found in the parameter hash.
+   * if the parameter value is enclosed in backticks, run the command and 
capture its stdout. If the command succeeds (returns 0), store the parameter in 
the hash with the  
+ value equal to stdout of the command. If the command fails (returns non-0 
value), report the error and abort the processing.
    * if the value is not a command, store it in the parameter hash.
    * if this is a duplicate parameter, warn and replace the old value with 
newly generated one.
   5. Resolve each command line parameter in the order they are specified on 
the command line
    * use the same resolution steps as for parameters passed in a file
   6. For each line in the input script
    * if comment or empty line, copy over
-   * if declare line resolve the paramter using the same steps as for 
parameters passed in a file
+   * if declare line resolve the parameter using the same steps as for 
parameters passed in a file
-   * if default line is encountered, the parameter defined is looked up in the 
parameter hash. If the parameter is not found, processing identical to declare 
line is performed; otherwise, the line is skipped.
+   * if default line is encountered, the parameter defined is looked up in the 
parameter hash. If the parameter is not found, processing identical to declare 
line is  
+ performed; otherwise, the line is skipped.
    * for all other lines
-    * search the line for variables that need to be replaced and perform 
replacement if needed. Generate an error and abort if replacement is needed but 
the correspondent parameter is not found in the parameter hash. (Reuse the code 
from the parameter substitution in declare statement.)
+    * search the line for variables that need to be replaced and perform 
replacement if needed. Generate an error and abort if replacement is needed but 
the correspondent  
+ parameter is not found in the parameter hash. (Reuse the code from the 
parameter substitution in declare statement.)
     * place the substituted line into the output file.
   4. If -dryrun is not specified, pass the output file to grunt to execute. 
Otherwise, print the name of the file and exit.
   5. if neither -debug nor -dryrun are specified, remove the output file.
  
- == Code Integration ==
+ == Future Features ==
  
- TBW
+ One nice feature to add later is to be able to constrain parameter names. For 
instance in the statement below the intent might be to replace only `$date` and 
leave `latest`  
+ in the path.
  
+ {{{
+ A = load 'data/$date_latest';
+ ...
+ }}}
+ 
+ This can be specified with perl-style syntax:
+ 
+ {{{
+ A = load 'data/${date}_latest';
+ ...
+ }}}
+ 

Reply via email to