Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by GuntherHagleitner: http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification ------------------------------------------------------------------------------ [[Anchor(Explicit/implicit_split:)]] ==== Explicit/implicit split: ==== - There might be cases in which you want different processing on separate parts of the same datastream. Like so: + There might be cases in which you want different processing on separate parts of the same data stream. Like so: {{{ A = load ... @@ -82, +82 @@ Batch mode is entered when Pig is given a script to execute. Interactive mode is on the grunt shell ("grunt:>"). Right now there isn't much difference between them. In order for us to optimize the multi-query case, we'll need to distinguish the two more. - Right now whenever the parser sees a store (or dump, explain, illustrate or describe) it will kick of the execution of that part of the script. Part of this proposal is that in batch mode, we parse the entire script first and see if we can combine things to reduce the overall amount of work that needs to be done. Only after that will the execution start. + Right now whenever the parser sees a store (or dump, illustrate) it will kick of the execution of that part of the script. Part of this proposal is that in batch mode, we parse the entire script first and see if we can combine things to reduce the overall amount of work that needs to be done. Only after that will the execution start. The following changes are proposed (in batch): @@ -90, +90 @@ * Explicit splits will be put in places where a handle has multiple children. If the user wants to explicitly force re-computation of common ancestors she has to provide multiple scripts. * Multiple split branches/stores in the script will be combined into the same job, if possible. Again, using multiple scripts is the way to go to avoid this (if that is desired). - For diagnostic operators there are some problems with this: + Some problems with this: - * They work on handles, which only gives you a slice of the entire script execution at a time. What's more, is that at the point they may occur in a script they might not give you an accurate picture about the situation, since the execution plans might change once the entire script is handled. + * Explain works on handles, which only gives you a slice of the entire script execution at a time. What's more, is that at the point they may occur in a script they might not give you an accurate picture about the situation, since the execution plans might change once the entire script is handled. - * They change the logical tree. This means that we need to clone the tree before we run them - something that we want to avoid in batch execution. + * Debugging on the grunt shell is more complicated, since scripts run differently that what one might type on the shell. The proposal therefore is: - * Have Pig in batch mode ignore explain, dump, illustrate and describe. - * Add a load command to the shell to execute a script in interactive mode. + * Add a run/exec commands to the shell to execute a script in interactive or batch mode for debugging. - * Add scripts as a target (in additions to handles) to some diagnostic parameters. + * Add scripts as a target (in additions to handles) to explain. * Add dot as an output type to explain (a graphical explanation of the graph will make multi-query explains more understandable.) - That means that while someone is developing a PIG script they can put any diagnostic operator into the script and then go to the grunt shell and load the script. The statement will be executed and give you some information about that part of the script. When a script is loaded, the user will also be able to refer to any handles defined in the script on the shell. - - Finally, when the script is ready the user can run the same script in batch and all the diagnostic operators are ignored. - - [[Anchor(Load)]] + [[Anchor(Run)]] - ==== Load ==== + ==== Run ==== (See https://issues.apache.org/jira/browse/PIG-574 - this is basically the same as requested there) The new command has the format: {{{ - load <script name> + run [-param <key>=<value>]* [-param_file <filename>] <script name> }}} Which will run the script in interactive mode. + + [[Anchor(Exec)]] + ==== Exec ==== + + The new command has the format: + + {{{ + exec [-param <key>=<value>]* [-param_file <filename>]* [<script name>] + }}} + + Which will run the script in batch mode. Exec without any parameters can be used in scripts to force execution up to the point in the script where the exec occurs. [[Anchor(Explain)]] ==== Explain ==== @@ -125, +131 @@ Changes to the command: {{{ - explain <script>||<handle> [using text||dot] [into <path>] + explain [-out <path>] [-brief] [-dot] [-param <key>=<value>]* [-param_file <filename>]* [-script <scriptname>] [<handle>] }}} Behavior: - * Explain is not executed in batch mode. - * If explain is given a script, it will output the entire execution graph (logical, physical, MR + moving result files) + * If explain is given a script without a handle, it will output the entire execution graph (logical, physical, MR) + * If explain is given a script with a handle, it will output the plan for the handle given + * If no script is given, explain works as before - Text/Dot: + Dot: * Text will give what we have today, dot will output a format that can be passed to dot for graphical display. * In Text mode, multiple output (split) will be broken out in sections. - * Default (no using clause): Text + * Default (-dot): Text - Path: + Out: - * Will generate logical.[txt||dot], physical.[txt||dot], mapred.[txt||dot] in the specified directory. + * Will generate logical_plan.[txt||dot], physical_plan.[text||dot], exec_plan.[text||dot] in the specified directory. * Default (no path given): Stdout + Brief: - [[Anchor(Illustrate)]] - ==== Illustrate ==== - Changes to the command: + * Does not expand nested plans + Param/Param_file: - {{{ - illustrate <script>||<handle> [into <file>] - }}} + * Allows for param substitution in scripts. - Behavior: - - * Illustrate is not executed in batch mode. - * If illustrate is given a script, it will output the entire execution graph (logical, physical, MR + moving result files) - - File: - - * Will write the illustrate output into the specified file. - * Default: Stdout [[Anchor(Phases)]] == Phases == @@ -203, +199 @@ In order to do this we will change the batch mode of the parser to: - * Not execute the plan when we see a store (or dump, illustrate, describe, explain - which will be ignored) + * Not execute the plan when we see a store or dump * Alter the already existing merge functionality to allow intersecting graphs to be joined into a single logical plan. * Wait until the entire script is parsed and merged before sending the plan on to do validation, optimization, etc. - The new "load" command will simply feed all the lines of a script through the interactive mode. + The new "run" command will simply feed all the lines of a script through the interactive mode. - [[Anchor(Explain,_Dump,_and_Illustrate_(Phase_1))]] - ==== Explain, Dump, Describe and Illustrate (Phase 1) ==== + [[Anchor(Explain_(Phase_1))]] + ==== Explain (Phase 1) ==== As described above the changes are: - * Ignore these operations in batch mode * Add options to explain and illustrate to work on a script file as well as a handle. * Add the ability to print plans as dot files and to write explain and illustrate output to files.