[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-04-09 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

--
  Parameters can be passed via pig command line using `-param =` 
construct. Multiple parameters can be specified. If the same parameter is 
specified multiple times, the last value will be used and a warning will be 
generated.
  
  {{{
- pig -param date=\'20080201\'
+ pig -param date=20080201
  }}}
  
   Parameter File 
@@ -56, +56 @@

  {{{
  # my parameters
  
- date = '20080201'
+ date = 20080201
  cmd = `generate_name`
  }}}
  
@@ -95, +95 @@

  
   Value Format 
  
- Value formats are identical regardless of how the parameter is specified and 
can be of two types. First is a sequence of characters enclosed in single or 
double quotes. In this case the unquoted version of the value is used during 
substitution. Quotes within the value can be escaped.
+ Value formats are identical regardless of how the parameter is specified and 
can be of two types. First is a sequence of characters enclosed in single or 
double quotes. In this case the unquoted version of the value is used during 
substitution. Quotes within the value can be escaped. Single word values that 
dont use special characters such as `%` or `=` don't have to be quoted.
  
  {{{
  %declare DESC 'Joe\'s URL'


[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-04-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

--
  Parameters can be passed via pig command line using `-param =` 
construct. Multiple parameters can be specified. If the same parameter is 
specified multiple times, the last value will be used and a warning will be 
generated.
  
  {{{
- pig -param date='20080201'
+ pig -param date=\'20080201\'
  }}}
  
   Parameter File 


[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-04-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

--
  .
  }}}
  
+ In this example, the value of the `date` is expected to be passed on each 
invocation of the script and is substituted before running the pig script. An 
error is generated if the value for any parameter is not found.
- In this example, the value of the `date` parameter is expected to be passed 
on each invocation of the script and is substituted in before running the pig 
script. An error  
- is generated if the value for any parameter is not found.
  
- A parameter name have a structure of a standard language identifier: it must 
start with a latter or underscore followed by any number of letters, digits, 
and underscores. The  
+ A parameter name have a structure of a standard language identifier: it must 
start with a letter or underscore followed by any number of letters, digits, 
and underscores. The names are case insensitive. The names can be escaped with 
`\` in which case substitution does not take place.
- names are case insensitive. The names can be escaped with `\` in which case 
substitution does not take place.
  
  In the initial version of the software the parameters are only allowed when 
pig script is specified. They are disabled with `-e` switch or in the 
interactive mode. 
  
@@ -43, +41 @@

  
   Command Line 
  
- Parameters can be passed via pig command line using `-param =` 
construct. Multiple parameters can be specified. If the same parameter is 
specified multiple  
+ Parameters can be passed via pig command line using `-param =` 
construct. Multiple parameters can be specified. If the same parameter is 
specified multiple times, the last value will be used and a warning will be 
generated.
- times, the last value will be used and a warning will be generated.
- 
- The command line for Example 4 above would look as follows:
  
  {{{
  pig -param date='20080201'
@@ -54, +49 @@

  
   Parameter File 
  
+ Parameters can also be specified in a file that can be passed to pig using 
`-param_file ` construct. Multiple files can be specified. If the same 
parameter is present multiple times in the file, the last value will be used 
and a warning will be generated. If a parameter present in multiple files, the 
value from the last file will be used and a warning will be generated.
- Parameters can also be specified in a file that can be passed to pig using 
`-param_file ` construct. Multiple files can be specified. If the same 
parameter is present  
- multiple times in the file, the last value will be used and a warning will be 
generated. If a parameter present in multiple files, the value from the last 
file will be used  
- and a warning will be generated.
  
+ A parameter file will contain one line per parameter. Empty lines are 
allowed. Perl style (#) comment lines are also allowed. Comments must take a 
full line and `#` must be the first character on the line. Each parameter line 
will be of the form: `=`. White spaces around `=` are 
allowed but are optional. 
- A parameter file will contain one line per parameter. Empty lines are 
allowed. Perl style (#) comment lines are also allowed. Comments must take a 
full line and `#` must be  
- the first character on the line. Each parameter line will be of the form: 
`=`. White spaces around `=` are allowed but are 
optional. 
  
  {{{
  # my parameters
@@ -68, +60 @@

  cmd = `generate_name`
  }}}
  
- Files and command line parameters can be combined, with command line 
parameters taking precedence over files in case of duplicate parameters.
+ Files and command line parameters can be combined with command line 
parameters taking precedence.
  
   Declare Statement 
  
@@ -83, +75 @@

  
  The format is `%declare  `
  
+ `declare` command starts with `%` to indicate that this is a preprocessor 
command that is processed prior to executing pig script. It takes the highest 
precedence. The scope of parameter value defined via `declare` is all the lines 
following `declare` command until the next `declare` command that defines the 
same parameter is encountered.
- `declare` command starts with `%` to indicate that this is a preprocessor 
command that is processed prior to executing pig script. It takes the highest 
precedence. The  
- scope of parameter value defined via `declare` is all the lines following 
`declare` command until the next `declare` command that defines this parameter 
is encountered.
  
   Default Statement 
  
- `default` command can be used to provide a default value for a parameter. 
This value is used if the parameter has no value defined by any other means. 
(`default` has the  
+ `default` command can be used to provide a default value for a parameter. 
This value is used if the p

[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-04-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

--
  
  == Motivation ==
  
- This document describes a proposal for implementing parameter substitution in 
pig. This proposal is motivated by multiple requests from users who would like 
to create a template pig script and then use it with different parameters on a 
regular basis. For instance, if you have daily processing that is identical 
every day except the date it needs to process, it would be very convenient to 
put a placeholder for the date and provide the actual value at run time.
+ This document describes a proposal for implementing parameter substitution in 
pig. This proposal is motivated by multiple requests from users who would like 
to create a  
+ template pig script and then use it with different parameters on a regular 
basis. For instance, if you have daily processing that is identical every day 
except the date it  
+ needs to process, it would be very convenient to put a placeholder for the 
date and provide the actual value at run time.
  
  == Requirements ==
  
@@ -17, +19 @@

  
  == Interface ==
  
- === Parameter Specification ===
+ === Using Parameters ===
  
- Parameters in a pig script will be of the form `$`. 
+ Parameters in a pig script are in the form of `$`. 
  
  {{{
  A = load '/data/mydata/$date';
@@ -27, +29 @@

  .
  }}}
  
- For this example, pig would expect `date` to be passed from pig command line 
or from a parameter file. The value would be substituted prior to running the 
load statement.
+ In this example, the value of the `date` parameter is expected to be passed 
on each invocation of the script and is substituted in before running the pig 
script. An error  
+ is generated if the value for any parameter is not found.
  
- In addition to supplying parameter value, a user can supply a command to 
execute to generate a parameter value. This can be done using `declare` 
statement. 
+ A parameter name have a structure of a standard language identifier: it must 
start with a latter or underscore followed by any number of letters, digits, 
and underscores. The  
+ names are case insensitive. The names can be escaped with `\` in which case 
substitution does not take place.
+ 
+ In the initial version of the software the parameters are only allowed when 
pig script is specified. They are disabled with `-e` switch or in the 
interactive mode. 
+ 
+ === Specifying Parameters ===
+ 
+ Parameter value can be supplied in four different ways.
+ 
+  Command Line 
+ 
+ Parameters can be passed via pig command line using `-param =` 
construct. Multiple parameters can be specified. If the same parameter is 
specified multiple  
+ times, the last value will be used and a warning will be generated.
+ 
+ The command line for Example 4 above would look as follows:
+ 
+ {{{
+ pig -param date='20080201'
+ }}}
+ 
+  Parameter File 
+ 
+ Parameters can also be specified in a file that can be passed to pig using 
`-param_file ` construct. Multiple files can be specified. If the same 
parameter is present  
+ multiple times in the file, the last value will be used and a warning will be 
generated. If a parameter present in multiple files, the value from the last 
file will be used  
+ and a warning will be generated.
+ 
+ A parameter file will contain one line per parameter. Empty lines are 
allowed. Perl style (#) comment lines are also allowed. Comments must take a 
full line and `#` must be  
+ the first character on the line. Each parameter line will be of the form: 
`=`. White spaces around `=` are allowed but are 
optional. 
+ 
+ {{{
+ # my parameters
+ 
+ date = '20080201'
+ cmd = `generate_name`
+ }}}
+ 
+ Files and command line parameters can be combined, with command line 
parameters taking precedence over files in case of duplicate parameters.
+ 
+  Declare Statement 
+ 
+ `declare` command can be used from within pig script. The use case for this 
is to describe one parameter in terms of other(s).
+ 
+ {{{
+ %declare CMD `$mycmd $date`
+ A = load '/data/mydata/$CMD';
+ B = filter A by $0>'5';
+ .
+ }}}
+ 
+ The format is `%declare  `
+ 
+ `declare` command starts with `%` to indicate that this is a preprocessor 
command that is processed prior to executing pig script. It takes the highest 
precedence. The  
+ scope of parameter value defined via `declare` is all the lines following 
`declare` command until the next `declare` command that defines this parameter 
is encountered.
+ 
+  Default Statement 
+ 
+ `default` command can be used to provide a default value for a parameter. 
This value is used if the parameter has no value defined by any other means. 
(`default` has the  
+ lowest priority.).
+ 
+ `default` has the format and scoping rules identica

[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-03-14 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

--
  
  A -dryrun option will be added to pig in which case no execution is performed 
and substituted script is produced. We can also use the same option to produce 
just the execution plan.
  
+ === Logging === 
+ 
+ Pig uses apache commons(http://commons.apache.org/logging/) in conjunction 
with log4j(http://logging.apache.org/log4j/) and we should to the same in the 
parameter substitution code.
+ 
+ The following code can be used to instanciate a logger:
+ 
+ {{{
+ import org.apache.commons.logging.Log;
+ import org.apache.commons.logging.LogFactory;
+ 
+ 
+ class ParameterSubstitutionPreprocessor
+ {
+ private final Log log = LogFactory.getLog(getClass());
+ 
+ }
+ }}}
+ 
+ Note that this code will work once we integrate this into Pig.
+ 
+ Pig uses INFO as the default log level. Any messages that you want users to 
see during normal operation should be logged at this level. Anything that is 
only useful for debugging, should be logged at DEBUG level. Warnings should be 
logged at WARN level.
+ 
+ === Error Handling ===
+ 
+ All the errors should be propagated via exceptions. (The code should not use 
exit calls to make sure that the caller can react to the error.)
+ 
+ The following exceptions should be used:
+ 
+  * ParseExceptions - for any errors due to parsing command line or config 
file parameters or pig script.
+  * If the underlying code throws an exception and the exception is derived 
from RuntimeException - just let it propagate
+  * If the underlying code throws an exception that is not derived from 
RuntimeException, catch it and throw a RuntimeException with the original 
exception as the cause. (We want to make sure that we don't have to declare 
additional exceptions in our APIs.)
+  * Any exception that the code originates should be either RuntimeException 
or its derivation if appropriate.
+ 
  == Design ==
  
  A C-style preprocessor will be written to perform parameter substitution. The 
preprocessor will do the following:


[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-02-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

--
  
  For this example, pig would execute `generate_date` command when it 
encounters the `declare` statement and assigns the result (stdout) to parameter 
`CMD`. The value of `CMD` is substituted prior to running the load statement.
  
- `declare` statement starts with `%` to indicate that it is part of the 
preprocessor that performs parameter substitution rather than Pig language 
itself. 
+ `declare` statement starts with `%` to indicate that it is part of the 
preprocessor that performs parameter substitution rather than Pig language 
itself. The declare statement runs till the end of the line unless the value is 
a literal in which case it can take multiple lines.
  
  `declare` can also be used to define one parameter in terms of others:
  
@@ -106, +106 @@

  
  Files and command line parameters can be combined, with command line 
parameters taking precedence over files in case of duplicate parameters.
  
- `declare` command takes the highest precedence. Having multiple `declare` 
commands defining the same parameter is an error that results in an error 
message and abort of the processing.
+ `declare` command takes the highest precedence. The scope of parameter value 
defined via `declare` is all the lines following `declare` command until the 
next `declare` command that defines this parameter is encountered.
  
- Default parameter values can be specified in a script using `%default  
` statement. This statement is identical to `declare` except that it has 
the lowest precedence meaning that its value is only used if it has not been 
defined before.
+ Default parameter values can be specified in a script using `%default  
` statement. This statement is identical to `declare` except that it has 
the lowest precedence meaning that its value is only used if it has not been 
defined before. Only first `default` statement for a particular parameter is 
meaningful. The rest are warned on and are ignored.
  
  {{{
  %default cmd=generate_name


[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-02-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

--
  
  === Parameter Specification ===
  
- Parameters in a pig script will be of the form `%`. 
+ Parameters in a pig script will be of the form `$`. 
  
  {{{
  A = load '/data/mydata/$date';


[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-02-25 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

--
  
  === Parameter Specification ===
  
- Parameters in a pig script will be of the form `$`. 
+ Parameters in a pig script will be of the form `%`. 
  
  {{{
  A = load '/data/mydata/$date';


[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-02-11 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

--
  A C-style preprocessor will be written to perform parameter substitution. The 
preprocessor will do the following:
  
   1. Create an empty `.substituted` file in the current working 
directory
-  2. Read parameters from files, command line and populate parameter hash 
using precedence rules describe above.
+  2. Create `parameter hash` that maps parameter names to parameter values.
+  3. Read parameters from files in the order they are specified on the command 
line
+  4. `Resolve each parameter`:
+   * search the parameter value for variables that need to be replaced and 
perform replacement if needed. Generate an error and abort if replacement is 
needed but the correspondent parameter is not found in the parameter hash.
+   * if the parameter value is enclosed in backticks, run the command and 
capture its stdout. If the command succeeds (returns 0), store the parameter in 
the hash with the value equal to stdout of the command. If the command fails 
(returns non-0 value), report the error and abort the processing.
+   * if the value is not a command, store it in the parameter hash.
+   * if this is a duplicate parameter, warn and replace the old value with 
newly generated one.
+  5. Resolve each command line parameter in the order they are specified on 
the command line
+   * use the same resolution steps as for parameters passed in a file
-  3. For each line in the input script
+  6. For each line in the input script
* if comment or empty line, copy over
+   * if declare line resolve the paramter using the same steps as for 
parameters passed in a file
-   * if declare line
-* search the line for variables that need to be replaced and perform 
replacement if needed. Generate an error and abort if replacement is needed but 
the correspondent parameter is not found in the parameter hash.
-* if the param value is enclosed in backticks, run the command and capture 
its stdout. If the command succeeds, store the parameter defined in `declare` 
in the parameter hash with its value set to command's stdout. If the command 
fails, report the error and abort the processing.
-* if declare statement is not a command, store it in the parameter hash.
-   * default line is encountered, the parameter defined is looked up in the 
parameter hash. If the parameter is not found, processing identical to declare 
line is performed; otherwise, the line is skipped.
+   * if default line is encountered, the parameter defined is looked up in the 
parameter hash. If the parameter is not found, processing identical to declare 
line is performed; otherwise, the line is skipped.
* for all other lines
 * search the line for variables that need to be replaced and perform 
replacement if needed. Generate an error and abort if replacement is needed but 
the correspondent parameter is not found in the parameter hash. (Reuse the code 
from the parameter substitution in declare statement.)
 * place the substituted line into the output file.


[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-02-11 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

--
  In addition to supplying parameter value, a user can supply a command to 
execute to generate a parameter value. This can be done using `declare` 
statement. 
  
  {{{
- #declare CMD `generate_date`
+ %declare CMD `generate_date`
  A = load '/data/mydata/$CMD';
  B = filter A by $0>'5';
  .
@@ -40, +40 @@

  
  For this example, pig would execute `generate_date` command when it 
encounters the `declare` statement and assigns the result (stdout) to parameter 
`CMD`. The value of `CMD` is substituted prior to running the load statement.
  
- `declare` statement starts with `#` to indicate that it is part of the 
preprocessor that performs parameter substitution rather than Pig language 
itself. 
+ `declare` statement starts with `%` to indicate that it is part of the 
preprocessor that performs parameter substitution rather than Pig language 
itself. 
  
  `declare` can also be used to define one parameter in terms of others:
  
  {{{
- #declare param1 ($param2 + $param3)
+ %declare param1 ($param2 + $param3)
  }}}
  
  With exception to string literals that can span multiple lines, for initial 
release, `declare` is a single-line command.
@@ -53, +53 @@

  The command specified within `declare` statement can take parameters which 
need to be substituted as well.
  
  {{{
- #declare CMD `generate_date $date`
+ %declare CMD `generate_date $date`
  A = load '/data/mydata/$CMD';
  B = filter A by $0>'5';
  .
@@ -64, +64 @@

  Note that variables passed on the command line must be resolved prior to the 
declare statement. The following sequence would cause an error:
  
  {{{
- #declare A `cmd1 $B`
+ %declare A `cmd1 $B`
- #declare $B `cmd2`
+ %declare $B `cmd2`
  }}}
  
  Command name itself can be a parameter.
  
  {{{
- #declare CMD `$mycmd $date`
+ %declare CMD `$mycmd $date`
  A = load '/data/mydata/$CMD';
  B = filter A by $0>'5';
  .
@@ -108, +108 @@

  
  `declare` command takes the highest precedence. Having multiple `declare` 
commands defining the same parameter is an error that results in an error 
message and abort of the processing.
  
- Default parameter values can be specified in a script using `#default  
` statement. This statement is identical to `declare` except that it has 
the lowest precedence meaning that its value is only used if it has not been 
defined before.
+ Default parameter values can be specified in a script using `%default  
` statement. This statement is identical to `declare` except that it has 
the lowest precedence meaning that its value is only used if it has not been 
defined before.
  
  {{{
- #default cmd=generate_name
+ %default cmd=generate_name
  }}}
+ 
+ Values specified from the command line as well as configuration file can be 
commands or expressions including other parameters. Their format is identical 
to `declare` and `default` format. Also, the same rule that variables need to 
be resolved before they can be used applies. The following order will be used:
+ 
+  1. Configuration files will be scanned in the order they are specified on 
the command line. Within each file, the parameteres are processed in the order 
they are specified.
+  2. Command line parameters will be scanned in the order they are specified 
on the command line.
+  3. declare/default commands will be processed in the order they appear in 
the pig script.
  
  === Debugging ===
  


[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-02-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

--
  .
  }}}
  
- In this example, parameters `cmd` and `date` are substituted first when 
`declare` statement is encountered. Then the resulting command is executed and 
its stdout is placed into the path prior to running the load statement.
+ In this example, parameters `mycmd` and `date` are substituted first when 
`declare` statement is encountered. Then the resulting command is executed and 
its stdout is placed into the path prior to running the load statement.
  
  Note that parameter names are case insensitive and $cmd and $CMD means the 
same thing. This is to match the rest of Pig Latin.
  


[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-02-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

--
  = Parameter Substitution in Pig =
  
- ==  Motivation ==
+ == Motivation ==
  
  This document describes a proposal for implementing parameter substitution in 
pig. This proposal is motivated by multiple requests from users who would like 
to create a template pig script and then use it with different parameters on a 
regular basis. For instance, if you have daily processing that is identical 
every day except the date it needs to process, it would be very convenient to 
put a placeholder for the date and provide the actual value at run time.
  
@@ -29, +29 @@

  
  For this example, pig would expect `date` to be passed from pig command line 
or from a parameter file. The value would be substituted prior to running the 
load statement.
  
- In addition to supplying parameter value, a user can supply a command to 
execute to generate a parameter value. This can be done using `declare` 
statement.
+ In addition to supplying parameter value, a user can supply a command to 
execute to generate a parameter value. This can be done using `declare` 
statement. 
  
  {{{
- declare CMD `generate_date`
+ #declare CMD `generate_date`
  A = load '/data/mydata/$CMD';
  B = filter A by $0>'5';
  .
@@ -40, +40 @@

  
  For this example, pig would execute `generate_date` command when it 
encounters the `declare` statement and assigns the result (stdout) to parameter 
`CMD`. The value of `CMD` is substituted prior to running the load statement.
  
- A command can take parameters which need to be substituted as well.
+ `declare` statement starts with `#` to indicate that it is part of the 
preprocessor that performs parameter substitution rather than Pig language 
itself. 
+ 
+ `declare` can also be used to define one parameter in terms of others:
  
  {{{
+ #declare param1 ($param2 + $param3)
+ }}}
+ 
+ With exception to string literals that can span multiple lines, for initial 
release, `declare` is a single-line command.
+ 
+ The command specified within `declare` statement can take parameters which 
need to be substituted as well.
+ 
+ {{{
- declare CMD `generate_date $date`
+ #declare CMD `generate_date $date`
  A = load '/data/mydata/$CMD';
  B = filter A by $0>'5';
  .
@@ -54, +64 @@

  Note that variables passed on the command line must be resolved prior to the 
declare statement. The following sequence would cause an error:
  
  {{{
- declare A `cmd1 $B`
+ #declare A `cmd1 $B`
- declare $B `cmd2`
+ #declare $B `cmd2`
  }}}
  
  Command name itself can be a parameter.
  
  {{{
- declare CMD `$mycmd $date`
+ #declare CMD `$mycmd $date`
  A = load '/data/mydata/$CMD';
  B = filter A by $0>'5';
  .
@@ -96, +106 @@

  
  Files and command line parameters can be combined, with command line 
parameters taking precedence over files in case of duplicate parameters.
  
- The fault parameter values can be specified in a script using `declare 
=` statement:
+ `declare` command takes the highest precedence. Having multiple `declare` 
commands defining the same parameter is an error that results in an error 
message and abort of the processing.
+ 
+ Default parameter values can be specified in a script using `#default  
` statement. This statement is identical to `declare` except that it has 
the lowest precedence meaning that its value is only used if it has not been 
defined before.
  
  {{{
- declare cmd=generate_name
+ #default cmd=generate_name
  }}}
- 
- Default values are only used if parameters is not specified.
- 
- `declare` can also be used to define one parameter in terms of others:
- 
- {{{
- declare param1 ($param2 + $param3)
- }}}
- 
- Note that `param2` and `param3` must be defined prior to this `declare` 
statement.
  
  === Debugging ===
  
@@ -122, +124 @@

  
  A C-style preprocessor will be written to perform parameter substitution. The 
preprocessor will do the following:
  
-  1. Create  an empty `.substituted` file in the current 
working directory
+  1. Create an empty `.substituted` file in the current working 
directory
   2. Read parameters from files, command line and populate parameter hash 
using precedence rules describe above.
   3. For each line in the input script
* if comment or empty line, copy over
@@ -130, +132 @@

 * search the line for variables that need to be replaced and perform 
replacement if needed. Generate an error and abort if replacement is needed but 
the correspondent parameter is not found in the parameter hash.
 * if the param value is enclosed in backticks, run the command and capture 
its stdout. If the command succeeds, store the parameter defined in `declare` 
in the parameter hash with its value set to command's stdout. If the command 
fails, report the error and ab

[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-02-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

--
  pig -param date='20080201' -param cmd='generate_name'
  }}}
  
- Parameters can also be specified in a file that can be passed to pig using 
`-param_file ` construct. Multiple files can be specified. If the same 
parameter is present multiple times in the file, the last value will be used. 
If a parameter present in multiple files, the value from the last file will be 
used and a warning will be generated.
+ Parameters can also be specified in a file that can be passed to pig using 
`-param_file ` construct. Multiple files can be specified. If the same 
parameter is present multiple times in the file, the last value will be used 
and a warning will be generated. If a parameter present in multiple files, the 
value from the last file will be used and a warning will be generated.
  
- A parameter file will contain one line per parameter. Empty lines are 
allowed. Perl style (#) comment lines are also allowed. Comment must take the 
full line and `#` must be the first character on the line. Each parameter line 
would be of the form: `=. White spaces around `=` are 
allowed but are optional. Param value can include white spaces. There is no 
need to quote the value and the quotes will be considered to be part of the 
value.
+ A parameter file will contain one line per parameter. Empty lines are 
allowed. Perl style (#) comment lines are also allowed. Comments must take a 
full line and `#` must be the first character on the line. Each parameter line 
will be of the form: `=`. White spaces around `=` are 
allowed but are optional. A parameter value can include white spaces. There is 
no need to quote the value and the quotes will be considered part of the value.
  
  The parameter file for Example 4 above would look as follows:
  
@@ -96, +96 @@

  
  Files and command line parameters can be combined, with command line 
parameters taking precedence over files in case of duplicate parameters.
  
- The fault parameter values can be specified in a script using `declare 
=` command:
+ The fault parameter values can be specified in a script using `declare 
=` statement:
  
  {{{
  declare cmd=generate_name
@@ -120, +120 @@

  
  == Design ==
  
- A C-style preprocessor will be writtem to perform parameter substitution. The 
preprocessor will do the following:
+ A C-style preprocessor will be written to perform parameter substitution. The 
preprocessor will do the following:
  
   1. Create  an empty `.substituted` file in the current 
working directory
   2. Read parameters from files, command line and populate parameter hash 
using precedence rules describe above.


[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-02-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

--
  
  For this example, parameter `date` is substituted first when `declare` 
statement is encountered. Then `generate_name` command is executed passing 
value of `date` as a parameter to it. Its output (stdout) is assigned to `CMD` 
which is used in the load statement prior to its execution.
  
- Note that the variables passed on the command line must be resolved prior to 
the declare statement. The following sequence would cause an error:
+ Note that variables passed on the command line must be resolved prior to the 
declare statement. The following sequence would cause an error:
  
  {{{
  declare A `cmd1 $B`
@@ -67, +67 @@

  .
  }}}
  
- In this example, parameters `cmd` and `date` are substituted first when 
`declare` statement is encountered. The the resulting command is executed and 
its stdout is placed into the path prior to running the load statement.
+ In this example, parameters `cmd` and `date` are substituted first when 
`declare` statement is encountered. Then the resulting command is executed and 
its stdout is placed into the path prior to running the load statement.
  
- Note that parameter names are case insesitive and $cmd and $CMD means the 
same thing. This is to match the rest of Pig Latin.
+ Note that parameter names are case insensitive and $cmd and $CMD means the 
same thing. This is to match the rest of Pig Latin.
  
  === Parameter Passing ===
  
@@ -94, +94 @@

  cmd = generate_name
  }}}
  
- Files and command line parameters can be combined with command line 
parameters taking precedence over files in case of duplicate parameters.
+ Files and command line parameters can be combined, with command line 
parameters taking precedence over files in case of duplicate parameters.
  
  The fault parameter values can be specified in a script using `declare 
=` command:
  
@@ -103, +103 @@

  }}}
  
  Default values are only used if parameters is not specified.
+ 
+ `declare` can also be used to define one parameter in terms of others:
+ 
+ {{{
+ declare param1 ($param2 + $param3)
+ }}}
+ 
+ Note that `param2` and `param3` must be defined prior to this `declare` 
statement.
  
  === Debugging ===
  
@@ -115, +123 @@

  A C-style preprocessor will be writtem to perform parameter substitution. The 
preprocessor will do the following:
  
   1. Create  an empty `.substituted` file in the current 
working directory
-  2. Read parameters from files, command line , and declare statement and 
construct a hash preserving the precedence rules in case of duplicates 
described above
+  2. Read parameters from files, command line and populate parameter hash 
using precedence rules describe above.
   3. For each line in the input script
-   * if declare line, skip
* if comment or empty line, copy over
+   * if declare line
+* search the line for variables that need to be replaced and perform 
replacement if needed. Generate an error and abort if replacement is needed but 
the correspondent parameter is not found in the parameter hash.
+* if the param value is enclosed in backticks, run the command and capture 
its stdout. If the command succeeds, store the parameter defined in `declare` 
in the parameter hash with its value set to command's stdout. If the command 
fails, report the error and abort the processing.
+* if declare statement is not a command, store it in the parameter hash.
* for all other lines
+* search the line for variables that need to be replaced and perform 
replacement if needed. Generate an error and abort if replacement is needed but 
the correspondent parameter is not found in the parameter hash. (Reuse the code 
from the parameter substitution in declare statement.)
-* parse each part enclosed in `%` and remove `%s`. Locate any identifier 
that starts with `$` and lookup it up in the hash. If found, make the 
substitution; otherwise, report an error and abort.
-* if the part is enclosed in backticks, run the substitued command. If the 
command succeeds, substitute the command with its stdout. If it fails, report 
an error and abort.
 * place the substituted line into the output file.
   4. If -dryrun is not specified, pass the output file to grunt to execute. 
Otherwise, print the name of the file and exit.
   5. if neither -debug nor -dryrun are specified, remove the output file.


[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-02-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

--
  == Requirements ==
  
   1. Ability to have parameters within a pig script and provide values for 
this parameters at run time.
-  2. Ability to provide parameter values on the command file
+  2. Ability to provide parameter values on the command line
   3. Ability to provide parameter values in a file
-  4. Ability to generate parameter value at run time by running a binary or 
script.
+  4. Ability to generate parameter values at run time by running a binary or a 
script.
   5. Ability to provide default values for parameters
   6. Ability to retain the script with all parameters resolved. This is mostly 
for debugging purposes.
  
@@ -19, +19 @@

  
  === Parameter Specification ===
  
- Parameters in pig script will be of the for `$`. 
+ Parameters in a pig script will be of the form `$`. 
  
  {{{
- A = load '/data/mydata/%$date%';
+ A = load '/data/mydata/$date';
  B = filter A by $0>'5';
  .
  }}}
  
- In this case, pig would expect the program to pass a value for parameter 
named `date` and will place it in the path before executing the load statement.
+ For this example, pig would expect `date` to be passed from pig command line 
or from a parameter file. The value would be substituted prior to running the 
load statement.
  
- In addition to parameters, a user can supply a command to execute to get the 
substitution value. This can be done using `declare` command.
+ In addition to supplying parameter value, a user can supply a command to 
execute to generate a parameter value. This can be done using `declare` 
statement.
  
  {{{
  declare CMD `generate_date`
@@ -38, +38 @@

  .
  }}}
  
- In this case, pig would execute the program called `generate_date` when it 
encounters `declare` statement and place its stdout into the path before 
executing the load statement.
+ For this example, pig would execute `generate_date` command when it 
encounters the `declare` statement and assigns the result (stdout) to parameter 
`CMD`. The value of `CMD` is substituted prior to running the load statement.
  
  A command can take parameters which need to be substituted as well.
  
@@ -49, +49 @@

  .
  }}}
  
- In this case, parameter `date` is substituted first, then `generate_name` 
command ran passing value of `date` as command line parameter and the stdout of 
the command is placed into the path prior to executing the load statement.
+ For this example, parameter `date` is substituted first when `declare` 
statement is encountered. Then `generate_name` command is executed passing 
value of `date` as a parameter to it. Its output (stdout) is assigned to `CMD` 
which is used in the load statement prior to its execution.
  
- Note that the variables passed on the command line must be resolved prior to 
the declare statement. This is wrong:
+ Note that the variables passed on the command line must be resolved prior to 
the declare statement. The following sequence would cause an error:
  
  {{{
  declare A `cmd1 $B`
@@ -67, +67 @@

  .
  }}}
  
- In this case, parameters `cmd` and `date` are substituted first. The the 
resulting command is executed and its stdout is placed into the path prior to 
running the load statement.
+ In this example, parameters `cmd` and `date` are substituted first when 
`declare` statement is encountered. The the resulting command is executed and 
its stdout is placed into the path prior to running the load statement.
  
  Note that parameter names are case insesitive and $cmd and $CMD means the 
same thing. This is to match the rest of Pig Latin.
  


[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-02-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

--
  
  === Parameter Specification ===
  
+ Parameters in pig script will be of the for `$`. 
- Parameters in the script will be enclosed in `%`. The content within `%` can 
be either a parameter name in which case it starts with `$` or a command to run 
to generate the value in which case it will be enclosed in backticks.
- 
- '''Example 1''': parameter name specification
  
  {{{
  A = load '/data/mydata/%$date%';
@@ -31, +29 @@

  
  In this case, pig would expect the program to pass a value for parameter 
named `date` and will place it in the path before executing the load statement.
  
- '''Example 2''': command specification
+ In addition to parameters, a user can supply a command to execute to get the 
substitution value. This can be done using `declare` command.
  
  {{{
- A = load '/data/mydata/%`generate_date`%';
+ declare CMD `generate_date`
+ A = load '/data/mydata/$CMD';
  B = filter A by $0>'5';
  .
  }}}
  
- In this case, pig would execute the program called `generate_date` and place 
its stdout into the path before executing the load statement.
+ In this case, pig would execute the program called `generate_date` when it 
encounters `declare` statement and place its stdout into the path before 
executing the load statement.
  
- '''Example 3''': command taking a parameter
+ A command can take parameters which need to be substituted as well.
  
  {{{
- A = load '/data/mydata/%`generate_name $date`%';
+ declare CMD `generate_date $date`
+ A = load '/data/mydata/$CMD';
  B = filter A by $0>'5';
  .
  }}}
  
  In this case, parameter `date` is substituted first, then `generate_name` 
command ran passing value of `date` as command line parameter and the stdout of 
the command is placed into the path prior to executing the load statement.
  
- '''Example 4''': command is passed as a parameter:
+ Note that the variables passed on the command line must be resolved prior to 
the declare statement. This is wrong:
  
  {{{
+ declare A `cmd1 $B`
+ declare $B `cmd2`
+ }}}
+ 
+ Command name itself can be a parameter.
+ 
+ {{{
+ declare CMD `$mycmd $date`
- A = load '/data/mydata/%`$cmd $date`%';
+ A = load '/data/mydata/$CMD';
  B = filter A by $0>'5';
  .
  }}}
  
  In this case, parameters `cmd` and `date` are substituted first. The the 
resulting command is executed and its stdout is placed into the path prior to 
running the load statement.
+ 
+ Note that parameter names are case insesitive and $cmd and $CMD means the 
same thing. This is to match the rest of Pig Latin.
  
  === Parameter Passing ===
  


[Pig Wiki] Update of "ParameterSubstitution" by OlgaN

2008-02-04 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by OlgaN:
http://wiki.apache.org/pig/ParameterSubstitution

New page:
= Parameter Substitution in Pig =

==  Motivation ==

This document describes a proposal for implementing parameter substitution in 
pig. This proposal is motivated by multiple requests from the who would like to 
create a template pig script and then used it with different parameters on a 
regular basis. For instance, if you have daily processing that is identical 
every day except the data it needs to process, it would be very convenient to 
put a placeholder for the date and provide the actual value at run time.

== Requirements ==

1. Ability to have parameters within a pig script and provide values for this 
parameters at run time.
2. Ability to provide parameter values on the command file
3. Ability to provide parameter values in a file
4. Ability to generate parameter value at run time by running a binary or 
script.
5. Ability to provide default values for parameters
6. Ability to retain the script with all parameters resolved. This is mostly 
for debugging purposes.

== Interface ==

=== Parameter Specification ===

Parameters in the script will be enclosed in `%`. The content within `%` can be 
either a parameter name in which case it starts with `$` or a command to run to 
generate the value in which case it will be enclosed in backticks.

'''Example 1''': parameter name specification

{{{
A = load '/data/mydata/%$date%';
B = filter A by $0>'5';
.
}}}

In this case, pig would expect the program to pass a value for parameter named 
`date` and will place it in the path before executing the load statement.

'''Example 2''': command specification

{{{
A = load '/data/mydata/%`generate_date`%';
B = filter A by $0>'5';
.
}}}

In this case, pig would execute the program called `generate_date` and place 
its stdout into the path before executing the load statement.

'''Example 3''': command taking a parameter

{{{
A = load '/data/mydata/%`generate_name $date`%';
B = filter A by $0>'5';
.
}}}

In this case, parameter `date` is substituted first, then `generate_name` 
command ran passing value of `date` as command line parameter and the stdout of 
the command is placed into the path prior to executing the load statement.

'''Example 4''': command is passed as a parameter:

{{{
A = load '/data/mydata/%`$cmd $date`%';
B = filter A by $0>'5';
.
}}}

In this case, parameters `cmd` and `date` are substituted first. The the 
resulting command is executed and its stdout is placed into the path prior to 
running the load statement.

=== Parameter Passing ===

Parameters can be specified on pig command line using `-param =` 
construct. Multiple parameters can be specified. If the same parameter is 
specified multiple times, the last value will be used and a warning will be 
generated.

The command line for Example 4 above would look as follows:

{{{
pig -param date='20080201' -param cmd='generate_name'
}}}

Parameters can also be specified in a file that can be passed to pig using 
`-param_file ` construct. Multiple files can be specified. If the same 
parameter is present multiple times in the file, the last value will be used. 
If a parameter present in multiple files, the value from the last file will be 
used and a warning will be generated.

A parameter file will contain one line per parameter. Empty lines are allowed. 
Perl style (#) comment lines are also allowed. Comment must take the full line 
and `#` must be the first character on the line. Each parameter line would be 
of the form: `=. White spaces around `=` are allowed 
but are optional. Param value can include white spaces. There is no need to 
quote the value and the quotes will be considered to be part of the value.

The parameter file for Example 4 above would look as follows:

{{{
# my parameters

date = 20080201
cmd = generate_name
}}}

Files and command line parameters can be combined with command line parameters 
taking precedence over files in case of duplicate parameters.

The fault parameter values can be specified in a script using `declare 
=` command:

{{{
declare cmd=generate_name
}}}

Default values are only used if parameters is not specified.

=== Debugging ===

If -debug option is specified to pig, it will produce fully substituted pig 
script in the current working directory named `.substituted`

A -dryrun option will be added to pig in which case no execution is performed 
and substituted script is produced. We can also use the same option to produce 
just the execution plan.

== Design ==

A C-style preprocessor will be writtem to perform parameter substitution. The 
preprocessor will do the following:

 1. Create  an empty `.substituted` file in the current working 
directory
 2. Read parameters from files, command line , and declare statement and 
construct a hash preserving the precedence rules in