[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution -- Parameters can be passed via pig command line using `-param =` construct. Multiple parameters can be specified. If the same parameter is specified multiple times, the last value will be used and a warning will be generated. {{{ - pig -param date=\'20080201\' + pig -param date=20080201 }}} Parameter File @@ -56, +56 @@ {{{ # my parameters - date = '20080201' + date = 20080201 cmd = `generate_name` }}} @@ -95, +95 @@ Value Format - Value formats are identical regardless of how the parameter is specified and can be of two types. First is a sequence of characters enclosed in single or double quotes. In this case the unquoted version of the value is used during substitution. Quotes within the value can be escaped. + Value formats are identical regardless of how the parameter is specified and can be of two types. First is a sequence of characters enclosed in single or double quotes. In this case the unquoted version of the value is used during substitution. Quotes within the value can be escaped. Single word values that dont use special characters such as `%` or `=` don't have to be quoted. {{{ %declare DESC 'Joe\'s URL'
[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution -- Parameters can be passed via pig command line using `-param =` construct. Multiple parameters can be specified. If the same parameter is specified multiple times, the last value will be used and a warning will be generated. {{{ - pig -param date='20080201' + pig -param date=\'20080201\' }}} Parameter File
[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution -- . }}} + In this example, the value of the `date` is expected to be passed on each invocation of the script and is substituted before running the pig script. An error is generated if the value for any parameter is not found. - In this example, the value of the `date` parameter is expected to be passed on each invocation of the script and is substituted in before running the pig script. An error - is generated if the value for any parameter is not found. - A parameter name have a structure of a standard language identifier: it must start with a latter or underscore followed by any number of letters, digits, and underscores. The + A parameter name have a structure of a standard language identifier: it must start with a letter or underscore followed by any number of letters, digits, and underscores. The names are case insensitive. The names can be escaped with `\` in which case substitution does not take place. - names are case insensitive. The names can be escaped with `\` in which case substitution does not take place. In the initial version of the software the parameters are only allowed when pig script is specified. They are disabled with `-e` switch or in the interactive mode. @@ -43, +41 @@ Command Line - Parameters can be passed via pig command line using `-param =` construct. Multiple parameters can be specified. If the same parameter is specified multiple + Parameters can be passed via pig command line using `-param =` construct. Multiple parameters can be specified. If the same parameter is specified multiple times, the last value will be used and a warning will be generated. - times, the last value will be used and a warning will be generated. - - The command line for Example 4 above would look as follows: {{{ pig -param date='20080201' @@ -54, +49 @@ Parameter File + Parameters can also be specified in a file that can be passed to pig using `-param_file ` construct. Multiple files can be specified. If the same parameter is present multiple times in the file, the last value will be used and a warning will be generated. If a parameter present in multiple files, the value from the last file will be used and a warning will be generated. - Parameters can also be specified in a file that can be passed to pig using `-param_file ` construct. Multiple files can be specified. If the same parameter is present - multiple times in the file, the last value will be used and a warning will be generated. If a parameter present in multiple files, the value from the last file will be used - and a warning will be generated. + A parameter file will contain one line per parameter. Empty lines are allowed. Perl style (#) comment lines are also allowed. Comments must take a full line and `#` must be the first character on the line. Each parameter line will be of the form: `=`. White spaces around `=` are allowed but are optional. - A parameter file will contain one line per parameter. Empty lines are allowed. Perl style (#) comment lines are also allowed. Comments must take a full line and `#` must be - the first character on the line. Each parameter line will be of the form: `=`. White spaces around `=` are allowed but are optional. {{{ # my parameters @@ -68, +60 @@ cmd = `generate_name` }}} - Files and command line parameters can be combined, with command line parameters taking precedence over files in case of duplicate parameters. + Files and command line parameters can be combined with command line parameters taking precedence. Declare Statement @@ -83, +75 @@ The format is `%declare ` + `declare` command starts with `%` to indicate that this is a preprocessor command that is processed prior to executing pig script. It takes the highest precedence. The scope of parameter value defined via `declare` is all the lines following `declare` command until the next `declare` command that defines the same parameter is encountered. - `declare` command starts with `%` to indicate that this is a preprocessor command that is processed prior to executing pig script. It takes the highest precedence. The - scope of parameter value defined via `declare` is all the lines following `declare` command until the next `declare` command that defines this parameter is encountered. Default Statement - `default` command can be used to provide a default value for a parameter. This value is used if the parameter has no value defined by any other means. (`default` has the + `default` command can be used to provide a default value for a parameter. This value is used if the p
[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution -- == Motivation == - This document describes a proposal for implementing parameter substitution in pig. This proposal is motivated by multiple requests from users who would like to create a template pig script and then use it with different parameters on a regular basis. For instance, if you have daily processing that is identical every day except the date it needs to process, it would be very convenient to put a placeholder for the date and provide the actual value at run time. + This document describes a proposal for implementing parameter substitution in pig. This proposal is motivated by multiple requests from users who would like to create a + template pig script and then use it with different parameters on a regular basis. For instance, if you have daily processing that is identical every day except the date it + needs to process, it would be very convenient to put a placeholder for the date and provide the actual value at run time. == Requirements == @@ -17, +19 @@ == Interface == - === Parameter Specification === + === Using Parameters === - Parameters in a pig script will be of the form `$`. + Parameters in a pig script are in the form of `$`. {{{ A = load '/data/mydata/$date'; @@ -27, +29 @@ . }}} - For this example, pig would expect `date` to be passed from pig command line or from a parameter file. The value would be substituted prior to running the load statement. + In this example, the value of the `date` parameter is expected to be passed on each invocation of the script and is substituted in before running the pig script. An error + is generated if the value for any parameter is not found. - In addition to supplying parameter value, a user can supply a command to execute to generate a parameter value. This can be done using `declare` statement. + A parameter name have a structure of a standard language identifier: it must start with a latter or underscore followed by any number of letters, digits, and underscores. The + names are case insensitive. The names can be escaped with `\` in which case substitution does not take place. + + In the initial version of the software the parameters are only allowed when pig script is specified. They are disabled with `-e` switch or in the interactive mode. + + === Specifying Parameters === + + Parameter value can be supplied in four different ways. + + Command Line + + Parameters can be passed via pig command line using `-param =` construct. Multiple parameters can be specified. If the same parameter is specified multiple + times, the last value will be used and a warning will be generated. + + The command line for Example 4 above would look as follows: + + {{{ + pig -param date='20080201' + }}} + + Parameter File + + Parameters can also be specified in a file that can be passed to pig using `-param_file ` construct. Multiple files can be specified. If the same parameter is present + multiple times in the file, the last value will be used and a warning will be generated. If a parameter present in multiple files, the value from the last file will be used + and a warning will be generated. + + A parameter file will contain one line per parameter. Empty lines are allowed. Perl style (#) comment lines are also allowed. Comments must take a full line and `#` must be + the first character on the line. Each parameter line will be of the form: `=`. White spaces around `=` are allowed but are optional. + + {{{ + # my parameters + + date = '20080201' + cmd = `generate_name` + }}} + + Files and command line parameters can be combined, with command line parameters taking precedence over files in case of duplicate parameters. + + Declare Statement + + `declare` command can be used from within pig script. The use case for this is to describe one parameter in terms of other(s). + + {{{ + %declare CMD `$mycmd $date` + A = load '/data/mydata/$CMD'; + B = filter A by $0>'5'; + . + }}} + + The format is `%declare ` + + `declare` command starts with `%` to indicate that this is a preprocessor command that is processed prior to executing pig script. It takes the highest precedence. The + scope of parameter value defined via `declare` is all the lines following `declare` command until the next `declare` command that defines this parameter is encountered. + + Default Statement + + `default` command can be used to provide a default value for a parameter. This value is used if the parameter has no value defined by any other means. (`default` has the + lowest priority.). + + `default` has the format and scoping rules identica
[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution -- A -dryrun option will be added to pig in which case no execution is performed and substituted script is produced. We can also use the same option to produce just the execution plan. + === Logging === + + Pig uses apache commons(http://commons.apache.org/logging/) in conjunction with log4j(http://logging.apache.org/log4j/) and we should to the same in the parameter substitution code. + + The following code can be used to instanciate a logger: + + {{{ + import org.apache.commons.logging.Log; + import org.apache.commons.logging.LogFactory; + + + class ParameterSubstitutionPreprocessor + { + private final Log log = LogFactory.getLog(getClass()); + + } + }}} + + Note that this code will work once we integrate this into Pig. + + Pig uses INFO as the default log level. Any messages that you want users to see during normal operation should be logged at this level. Anything that is only useful for debugging, should be logged at DEBUG level. Warnings should be logged at WARN level. + + === Error Handling === + + All the errors should be propagated via exceptions. (The code should not use exit calls to make sure that the caller can react to the error.) + + The following exceptions should be used: + + * ParseExceptions - for any errors due to parsing command line or config file parameters or pig script. + * If the underlying code throws an exception and the exception is derived from RuntimeException - just let it propagate + * If the underlying code throws an exception that is not derived from RuntimeException, catch it and throw a RuntimeException with the original exception as the cause. (We want to make sure that we don't have to declare additional exceptions in our APIs.) + * Any exception that the code originates should be either RuntimeException or its derivation if appropriate. + == Design == A C-style preprocessor will be written to perform parameter substitution. The preprocessor will do the following:
[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution -- For this example, pig would execute `generate_date` command when it encounters the `declare` statement and assigns the result (stdout) to parameter `CMD`. The value of `CMD` is substituted prior to running the load statement. - `declare` statement starts with `%` to indicate that it is part of the preprocessor that performs parameter substitution rather than Pig language itself. + `declare` statement starts with `%` to indicate that it is part of the preprocessor that performs parameter substitution rather than Pig language itself. The declare statement runs till the end of the line unless the value is a literal in which case it can take multiple lines. `declare` can also be used to define one parameter in terms of others: @@ -106, +106 @@ Files and command line parameters can be combined, with command line parameters taking precedence over files in case of duplicate parameters. - `declare` command takes the highest precedence. Having multiple `declare` commands defining the same parameter is an error that results in an error message and abort of the processing. + `declare` command takes the highest precedence. The scope of parameter value defined via `declare` is all the lines following `declare` command until the next `declare` command that defines this parameter is encountered. - Default parameter values can be specified in a script using `%default ` statement. This statement is identical to `declare` except that it has the lowest precedence meaning that its value is only used if it has not been defined before. + Default parameter values can be specified in a script using `%default ` statement. This statement is identical to `declare` except that it has the lowest precedence meaning that its value is only used if it has not been defined before. Only first `default` statement for a particular parameter is meaningful. The rest are warned on and are ignored. {{{ %default cmd=generate_name
[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution -- === Parameter Specification === - Parameters in a pig script will be of the form `%`. + Parameters in a pig script will be of the form `$`. {{{ A = load '/data/mydata/$date';
[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution -- === Parameter Specification === - Parameters in a pig script will be of the form `$`. + Parameters in a pig script will be of the form `%`. {{{ A = load '/data/mydata/$date';
[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution -- A C-style preprocessor will be written to perform parameter substitution. The preprocessor will do the following: 1. Create an empty `.substituted` file in the current working directory - 2. Read parameters from files, command line and populate parameter hash using precedence rules describe above. + 2. Create `parameter hash` that maps parameter names to parameter values. + 3. Read parameters from files in the order they are specified on the command line + 4. `Resolve each parameter`: + * search the parameter value for variables that need to be replaced and perform replacement if needed. Generate an error and abort if replacement is needed but the correspondent parameter is not found in the parameter hash. + * if the parameter value is enclosed in backticks, run the command and capture its stdout. If the command succeeds (returns 0), store the parameter in the hash with the value equal to stdout of the command. If the command fails (returns non-0 value), report the error and abort the processing. + * if the value is not a command, store it in the parameter hash. + * if this is a duplicate parameter, warn and replace the old value with newly generated one. + 5. Resolve each command line parameter in the order they are specified on the command line + * use the same resolution steps as for parameters passed in a file - 3. For each line in the input script + 6. For each line in the input script * if comment or empty line, copy over + * if declare line resolve the paramter using the same steps as for parameters passed in a file - * if declare line -* search the line for variables that need to be replaced and perform replacement if needed. Generate an error and abort if replacement is needed but the correspondent parameter is not found in the parameter hash. -* if the param value is enclosed in backticks, run the command and capture its stdout. If the command succeeds, store the parameter defined in `declare` in the parameter hash with its value set to command's stdout. If the command fails, report the error and abort the processing. -* if declare statement is not a command, store it in the parameter hash. - * default line is encountered, the parameter defined is looked up in the parameter hash. If the parameter is not found, processing identical to declare line is performed; otherwise, the line is skipped. + * if default line is encountered, the parameter defined is looked up in the parameter hash. If the parameter is not found, processing identical to declare line is performed; otherwise, the line is skipped. * for all other lines * search the line for variables that need to be replaced and perform replacement if needed. Generate an error and abort if replacement is needed but the correspondent parameter is not found in the parameter hash. (Reuse the code from the parameter substitution in declare statement.) * place the substituted line into the output file.
[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution -- In addition to supplying parameter value, a user can supply a command to execute to generate a parameter value. This can be done using `declare` statement. {{{ - #declare CMD `generate_date` + %declare CMD `generate_date` A = load '/data/mydata/$CMD'; B = filter A by $0>'5'; . @@ -40, +40 @@ For this example, pig would execute `generate_date` command when it encounters the `declare` statement and assigns the result (stdout) to parameter `CMD`. The value of `CMD` is substituted prior to running the load statement. - `declare` statement starts with `#` to indicate that it is part of the preprocessor that performs parameter substitution rather than Pig language itself. + `declare` statement starts with `%` to indicate that it is part of the preprocessor that performs parameter substitution rather than Pig language itself. `declare` can also be used to define one parameter in terms of others: {{{ - #declare param1 ($param2 + $param3) + %declare param1 ($param2 + $param3) }}} With exception to string literals that can span multiple lines, for initial release, `declare` is a single-line command. @@ -53, +53 @@ The command specified within `declare` statement can take parameters which need to be substituted as well. {{{ - #declare CMD `generate_date $date` + %declare CMD `generate_date $date` A = load '/data/mydata/$CMD'; B = filter A by $0>'5'; . @@ -64, +64 @@ Note that variables passed on the command line must be resolved prior to the declare statement. The following sequence would cause an error: {{{ - #declare A `cmd1 $B` + %declare A `cmd1 $B` - #declare $B `cmd2` + %declare $B `cmd2` }}} Command name itself can be a parameter. {{{ - #declare CMD `$mycmd $date` + %declare CMD `$mycmd $date` A = load '/data/mydata/$CMD'; B = filter A by $0>'5'; . @@ -108, +108 @@ `declare` command takes the highest precedence. Having multiple `declare` commands defining the same parameter is an error that results in an error message and abort of the processing. - Default parameter values can be specified in a script using `#default ` statement. This statement is identical to `declare` except that it has the lowest precedence meaning that its value is only used if it has not been defined before. + Default parameter values can be specified in a script using `%default ` statement. This statement is identical to `declare` except that it has the lowest precedence meaning that its value is only used if it has not been defined before. {{{ - #default cmd=generate_name + %default cmd=generate_name }}} + + Values specified from the command line as well as configuration file can be commands or expressions including other parameters. Their format is identical to `declare` and `default` format. Also, the same rule that variables need to be resolved before they can be used applies. The following order will be used: + + 1. Configuration files will be scanned in the order they are specified on the command line. Within each file, the parameteres are processed in the order they are specified. + 2. Command line parameters will be scanned in the order they are specified on the command line. + 3. declare/default commands will be processed in the order they appear in the pig script. === Debugging ===
[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution -- . }}} - In this example, parameters `cmd` and `date` are substituted first when `declare` statement is encountered. Then the resulting command is executed and its stdout is placed into the path prior to running the load statement. + In this example, parameters `mycmd` and `date` are substituted first when `declare` statement is encountered. Then the resulting command is executed and its stdout is placed into the path prior to running the load statement. Note that parameter names are case insensitive and $cmd and $CMD means the same thing. This is to match the rest of Pig Latin.
[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution -- = Parameter Substitution in Pig = - == Motivation == + == Motivation == This document describes a proposal for implementing parameter substitution in pig. This proposal is motivated by multiple requests from users who would like to create a template pig script and then use it with different parameters on a regular basis. For instance, if you have daily processing that is identical every day except the date it needs to process, it would be very convenient to put a placeholder for the date and provide the actual value at run time. @@ -29, +29 @@ For this example, pig would expect `date` to be passed from pig command line or from a parameter file. The value would be substituted prior to running the load statement. - In addition to supplying parameter value, a user can supply a command to execute to generate a parameter value. This can be done using `declare` statement. + In addition to supplying parameter value, a user can supply a command to execute to generate a parameter value. This can be done using `declare` statement. {{{ - declare CMD `generate_date` + #declare CMD `generate_date` A = load '/data/mydata/$CMD'; B = filter A by $0>'5'; . @@ -40, +40 @@ For this example, pig would execute `generate_date` command when it encounters the `declare` statement and assigns the result (stdout) to parameter `CMD`. The value of `CMD` is substituted prior to running the load statement. - A command can take parameters which need to be substituted as well. + `declare` statement starts with `#` to indicate that it is part of the preprocessor that performs parameter substitution rather than Pig language itself. + + `declare` can also be used to define one parameter in terms of others: {{{ + #declare param1 ($param2 + $param3) + }}} + + With exception to string literals that can span multiple lines, for initial release, `declare` is a single-line command. + + The command specified within `declare` statement can take parameters which need to be substituted as well. + + {{{ - declare CMD `generate_date $date` + #declare CMD `generate_date $date` A = load '/data/mydata/$CMD'; B = filter A by $0>'5'; . @@ -54, +64 @@ Note that variables passed on the command line must be resolved prior to the declare statement. The following sequence would cause an error: {{{ - declare A `cmd1 $B` + #declare A `cmd1 $B` - declare $B `cmd2` + #declare $B `cmd2` }}} Command name itself can be a parameter. {{{ - declare CMD `$mycmd $date` + #declare CMD `$mycmd $date` A = load '/data/mydata/$CMD'; B = filter A by $0>'5'; . @@ -96, +106 @@ Files and command line parameters can be combined, with command line parameters taking precedence over files in case of duplicate parameters. - The fault parameter values can be specified in a script using `declare =` statement: + `declare` command takes the highest precedence. Having multiple `declare` commands defining the same parameter is an error that results in an error message and abort of the processing. + + Default parameter values can be specified in a script using `#default ` statement. This statement is identical to `declare` except that it has the lowest precedence meaning that its value is only used if it has not been defined before. {{{ - declare cmd=generate_name + #default cmd=generate_name }}} - - Default values are only used if parameters is not specified. - - `declare` can also be used to define one parameter in terms of others: - - {{{ - declare param1 ($param2 + $param3) - }}} - - Note that `param2` and `param3` must be defined prior to this `declare` statement. === Debugging === @@ -122, +124 @@ A C-style preprocessor will be written to perform parameter substitution. The preprocessor will do the following: - 1. Create an empty `.substituted` file in the current working directory + 1. Create an empty `.substituted` file in the current working directory 2. Read parameters from files, command line and populate parameter hash using precedence rules describe above. 3. For each line in the input script * if comment or empty line, copy over @@ -130, +132 @@ * search the line for variables that need to be replaced and perform replacement if needed. Generate an error and abort if replacement is needed but the correspondent parameter is not found in the parameter hash. * if the param value is enclosed in backticks, run the command and capture its stdout. If the command succeeds, store the parameter defined in `declare` in the parameter hash with its value set to command's stdout. If the command fails, report the error and ab
[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution -- pig -param date='20080201' -param cmd='generate_name' }}} - Parameters can also be specified in a file that can be passed to pig using `-param_file ` construct. Multiple files can be specified. If the same parameter is present multiple times in the file, the last value will be used. If a parameter present in multiple files, the value from the last file will be used and a warning will be generated. + Parameters can also be specified in a file that can be passed to pig using `-param_file ` construct. Multiple files can be specified. If the same parameter is present multiple times in the file, the last value will be used and a warning will be generated. If a parameter present in multiple files, the value from the last file will be used and a warning will be generated. - A parameter file will contain one line per parameter. Empty lines are allowed. Perl style (#) comment lines are also allowed. Comment must take the full line and `#` must be the first character on the line. Each parameter line would be of the form: `=. White spaces around `=` are allowed but are optional. Param value can include white spaces. There is no need to quote the value and the quotes will be considered to be part of the value. + A parameter file will contain one line per parameter. Empty lines are allowed. Perl style (#) comment lines are also allowed. Comments must take a full line and `#` must be the first character on the line. Each parameter line will be of the form: `=`. White spaces around `=` are allowed but are optional. A parameter value can include white spaces. There is no need to quote the value and the quotes will be considered part of the value. The parameter file for Example 4 above would look as follows: @@ -96, +96 @@ Files and command line parameters can be combined, with command line parameters taking precedence over files in case of duplicate parameters. - The fault parameter values can be specified in a script using `declare =` command: + The fault parameter values can be specified in a script using `declare =` statement: {{{ declare cmd=generate_name @@ -120, +120 @@ == Design == - A C-style preprocessor will be writtem to perform parameter substitution. The preprocessor will do the following: + A C-style preprocessor will be written to perform parameter substitution. The preprocessor will do the following: 1. Create an empty `.substituted` file in the current working directory 2. Read parameters from files, command line and populate parameter hash using precedence rules describe above.
[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution -- For this example, parameter `date` is substituted first when `declare` statement is encountered. Then `generate_name` command is executed passing value of `date` as a parameter to it. Its output (stdout) is assigned to `CMD` which is used in the load statement prior to its execution. - Note that the variables passed on the command line must be resolved prior to the declare statement. The following sequence would cause an error: + Note that variables passed on the command line must be resolved prior to the declare statement. The following sequence would cause an error: {{{ declare A `cmd1 $B` @@ -67, +67 @@ . }}} - In this example, parameters `cmd` and `date` are substituted first when `declare` statement is encountered. The the resulting command is executed and its stdout is placed into the path prior to running the load statement. + In this example, parameters `cmd` and `date` are substituted first when `declare` statement is encountered. Then the resulting command is executed and its stdout is placed into the path prior to running the load statement. - Note that parameter names are case insesitive and $cmd and $CMD means the same thing. This is to match the rest of Pig Latin. + Note that parameter names are case insensitive and $cmd and $CMD means the same thing. This is to match the rest of Pig Latin. === Parameter Passing === @@ -94, +94 @@ cmd = generate_name }}} - Files and command line parameters can be combined with command line parameters taking precedence over files in case of duplicate parameters. + Files and command line parameters can be combined, with command line parameters taking precedence over files in case of duplicate parameters. The fault parameter values can be specified in a script using `declare =` command: @@ -103, +103 @@ }}} Default values are only used if parameters is not specified. + + `declare` can also be used to define one parameter in terms of others: + + {{{ + declare param1 ($param2 + $param3) + }}} + + Note that `param2` and `param3` must be defined prior to this `declare` statement. === Debugging === @@ -115, +123 @@ A C-style preprocessor will be writtem to perform parameter substitution. The preprocessor will do the following: 1. Create an empty `.substituted` file in the current working directory - 2. Read parameters from files, command line , and declare statement and construct a hash preserving the precedence rules in case of duplicates described above + 2. Read parameters from files, command line and populate parameter hash using precedence rules describe above. 3. For each line in the input script - * if declare line, skip * if comment or empty line, copy over + * if declare line +* search the line for variables that need to be replaced and perform replacement if needed. Generate an error and abort if replacement is needed but the correspondent parameter is not found in the parameter hash. +* if the param value is enclosed in backticks, run the command and capture its stdout. If the command succeeds, store the parameter defined in `declare` in the parameter hash with its value set to command's stdout. If the command fails, report the error and abort the processing. +* if declare statement is not a command, store it in the parameter hash. * for all other lines +* search the line for variables that need to be replaced and perform replacement if needed. Generate an error and abort if replacement is needed but the correspondent parameter is not found in the parameter hash. (Reuse the code from the parameter substitution in declare statement.) -* parse each part enclosed in `%` and remove `%s`. Locate any identifier that starts with `$` and lookup it up in the hash. If found, make the substitution; otherwise, report an error and abort. -* if the part is enclosed in backticks, run the substitued command. If the command succeeds, substitute the command with its stdout. If it fails, report an error and abort. * place the substituted line into the output file. 4. If -dryrun is not specified, pass the output file to grunt to execute. Otherwise, print the name of the file and exit. 5. if neither -debug nor -dryrun are specified, remove the output file.
[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution -- == Requirements == 1. Ability to have parameters within a pig script and provide values for this parameters at run time. - 2. Ability to provide parameter values on the command file + 2. Ability to provide parameter values on the command line 3. Ability to provide parameter values in a file - 4. Ability to generate parameter value at run time by running a binary or script. + 4. Ability to generate parameter values at run time by running a binary or a script. 5. Ability to provide default values for parameters 6. Ability to retain the script with all parameters resolved. This is mostly for debugging purposes. @@ -19, +19 @@ === Parameter Specification === - Parameters in pig script will be of the for `$`. + Parameters in a pig script will be of the form `$`. {{{ - A = load '/data/mydata/%$date%'; + A = load '/data/mydata/$date'; B = filter A by $0>'5'; . }}} - In this case, pig would expect the program to pass a value for parameter named `date` and will place it in the path before executing the load statement. + For this example, pig would expect `date` to be passed from pig command line or from a parameter file. The value would be substituted prior to running the load statement. - In addition to parameters, a user can supply a command to execute to get the substitution value. This can be done using `declare` command. + In addition to supplying parameter value, a user can supply a command to execute to generate a parameter value. This can be done using `declare` statement. {{{ declare CMD `generate_date` @@ -38, +38 @@ . }}} - In this case, pig would execute the program called `generate_date` when it encounters `declare` statement and place its stdout into the path before executing the load statement. + For this example, pig would execute `generate_date` command when it encounters the `declare` statement and assigns the result (stdout) to parameter `CMD`. The value of `CMD` is substituted prior to running the load statement. A command can take parameters which need to be substituted as well. @@ -49, +49 @@ . }}} - In this case, parameter `date` is substituted first, then `generate_name` command ran passing value of `date` as command line parameter and the stdout of the command is placed into the path prior to executing the load statement. + For this example, parameter `date` is substituted first when `declare` statement is encountered. Then `generate_name` command is executed passing value of `date` as a parameter to it. Its output (stdout) is assigned to `CMD` which is used in the load statement prior to its execution. - Note that the variables passed on the command line must be resolved prior to the declare statement. This is wrong: + Note that the variables passed on the command line must be resolved prior to the declare statement. The following sequence would cause an error: {{{ declare A `cmd1 $B` @@ -67, +67 @@ . }}} - In this case, parameters `cmd` and `date` are substituted first. The the resulting command is executed and its stdout is placed into the path prior to running the load statement. + In this example, parameters `cmd` and `date` are substituted first when `declare` statement is encountered. The the resulting command is executed and its stdout is placed into the path prior to running the load statement. Note that parameter names are case insesitive and $cmd and $CMD means the same thing. This is to match the rest of Pig Latin.
[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution -- === Parameter Specification === + Parameters in pig script will be of the for `$`. - Parameters in the script will be enclosed in `%`. The content within `%` can be either a parameter name in which case it starts with `$` or a command to run to generate the value in which case it will be enclosed in backticks. - - '''Example 1''': parameter name specification {{{ A = load '/data/mydata/%$date%'; @@ -31, +29 @@ In this case, pig would expect the program to pass a value for parameter named `date` and will place it in the path before executing the load statement. - '''Example 2''': command specification + In addition to parameters, a user can supply a command to execute to get the substitution value. This can be done using `declare` command. {{{ - A = load '/data/mydata/%`generate_date`%'; + declare CMD `generate_date` + A = load '/data/mydata/$CMD'; B = filter A by $0>'5'; . }}} - In this case, pig would execute the program called `generate_date` and place its stdout into the path before executing the load statement. + In this case, pig would execute the program called `generate_date` when it encounters `declare` statement and place its stdout into the path before executing the load statement. - '''Example 3''': command taking a parameter + A command can take parameters which need to be substituted as well. {{{ - A = load '/data/mydata/%`generate_name $date`%'; + declare CMD `generate_date $date` + A = load '/data/mydata/$CMD'; B = filter A by $0>'5'; . }}} In this case, parameter `date` is substituted first, then `generate_name` command ran passing value of `date` as command line parameter and the stdout of the command is placed into the path prior to executing the load statement. - '''Example 4''': command is passed as a parameter: + Note that the variables passed on the command line must be resolved prior to the declare statement. This is wrong: {{{ + declare A `cmd1 $B` + declare $B `cmd2` + }}} + + Command name itself can be a parameter. + + {{{ + declare CMD `$mycmd $date` - A = load '/data/mydata/%`$cmd $date`%'; + A = load '/data/mydata/$CMD'; B = filter A by $0>'5'; . }}} In this case, parameters `cmd` and `date` are substituted first. The the resulting command is executed and its stdout is placed into the path prior to running the load statement. + + Note that parameter names are case insesitive and $cmd and $CMD means the same thing. This is to match the rest of Pig Latin. === Parameter Passing ===
[Pig Wiki] Update of "ParameterSubstitution" by OlgaN
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification. The following page has been changed by OlgaN: http://wiki.apache.org/pig/ParameterSubstitution New page: = Parameter Substitution in Pig = == Motivation == This document describes a proposal for implementing parameter substitution in pig. This proposal is motivated by multiple requests from the who would like to create a template pig script and then used it with different parameters on a regular basis. For instance, if you have daily processing that is identical every day except the data it needs to process, it would be very convenient to put a placeholder for the date and provide the actual value at run time. == Requirements == 1. Ability to have parameters within a pig script and provide values for this parameters at run time. 2. Ability to provide parameter values on the command file 3. Ability to provide parameter values in a file 4. Ability to generate parameter value at run time by running a binary or script. 5. Ability to provide default values for parameters 6. Ability to retain the script with all parameters resolved. This is mostly for debugging purposes. == Interface == === Parameter Specification === Parameters in the script will be enclosed in `%`. The content within `%` can be either a parameter name in which case it starts with `$` or a command to run to generate the value in which case it will be enclosed in backticks. '''Example 1''': parameter name specification {{{ A = load '/data/mydata/%$date%'; B = filter A by $0>'5'; . }}} In this case, pig would expect the program to pass a value for parameter named `date` and will place it in the path before executing the load statement. '''Example 2''': command specification {{{ A = load '/data/mydata/%`generate_date`%'; B = filter A by $0>'5'; . }}} In this case, pig would execute the program called `generate_date` and place its stdout into the path before executing the load statement. '''Example 3''': command taking a parameter {{{ A = load '/data/mydata/%`generate_name $date`%'; B = filter A by $0>'5'; . }}} In this case, parameter `date` is substituted first, then `generate_name` command ran passing value of `date` as command line parameter and the stdout of the command is placed into the path prior to executing the load statement. '''Example 4''': command is passed as a parameter: {{{ A = load '/data/mydata/%`$cmd $date`%'; B = filter A by $0>'5'; . }}} In this case, parameters `cmd` and `date` are substituted first. The the resulting command is executed and its stdout is placed into the path prior to running the load statement. === Parameter Passing === Parameters can be specified on pig command line using `-param =` construct. Multiple parameters can be specified. If the same parameter is specified multiple times, the last value will be used and a warning will be generated. The command line for Example 4 above would look as follows: {{{ pig -param date='20080201' -param cmd='generate_name' }}} Parameters can also be specified in a file that can be passed to pig using `-param_file ` construct. Multiple files can be specified. If the same parameter is present multiple times in the file, the last value will be used. If a parameter present in multiple files, the value from the last file will be used and a warning will be generated. A parameter file will contain one line per parameter. Empty lines are allowed. Perl style (#) comment lines are also allowed. Comment must take the full line and `#` must be the first character on the line. Each parameter line would be of the form: `=. White spaces around `=` are allowed but are optional. Param value can include white spaces. There is no need to quote the value and the quotes will be considered to be part of the value. The parameter file for Example 4 above would look as follows: {{{ # my parameters date = 20080201 cmd = generate_name }}} Files and command line parameters can be combined with command line parameters taking precedence over files in case of duplicate parameters. The fault parameter values can be specified in a script using `declare =` command: {{{ declare cmd=generate_name }}} Default values are only used if parameters is not specified. === Debugging === If -debug option is specified to pig, it will produce fully substituted pig script in the current working directory named `.substituted` A -dryrun option will be added to pig in which case no execution is performed and substituted script is produced. We can also use the same option to produce just the execution plan. == Design == A C-style preprocessor will be writtem to perform parameter substitution. The preprocessor will do the following: 1. Create an empty `.substituted` file in the current working directory 2. Read parameters from files, command line , and declare statement and construct a hash preserving the precedence rules in