[Pig Wiki] Update of "PigMultiQueryPerformanceSpecification" by GuntherHagleitner

Apache Wiki Mon, 23 Mar 2009 12:13:11 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by GuntherHagleitner:
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification

------------------------------------------------------------------------------
  [[Anchor(Explicit/implicit_split:)]]
  ==== Explicit/implicit split: ====
  
- There might be cases in which you want different processing on separate parts 
of the same datastream. Like so:
+ There might be cases in which you want different processing on separate parts 
of the same data stream. Like so:
  
  {{{
  A = load ...
@@ -82, +82 @@

  
  Batch mode is entered when Pig is given a script to execute. Interactive mode 
is on the grunt shell ("grunt:>"). Right now there isn't much difference 
between them. In order for us to optimize the multi-query case, we'll need to 
distinguish the two more. 
  
- Right now whenever the parser sees a store (or dump, explain, illustrate or 
describe) it will kick of the execution of that part of the script. Part of 
this proposal is that in batch mode, we parse the entire script first and see 
if we can combine things to reduce the overall amount of work that needs to be 
done. Only after that will the execution start. 
+ Right now whenever the parser sees a store (or dump, illustrate) it will kick 
of the execution of that part of the script. Part of this proposal is that in 
batch mode, we parse the entire script first and see if we can combine things 
to reduce the overall amount of work that needs to be done. Only after that 
will the execution start. 
  
  The following changes are proposed (in batch):
  
@@ -90, +90 @@

     * Explicit splits will be put in places where a handle has multiple 
children. If the user wants to explicitly force re-computation of common 
ancestors she has to provide multiple scripts.
     * Multiple split branches/stores in the script will be combined into the 
same job, if possible. Again, using multiple scripts is the way to go to avoid 
this (if that is desired).
  
- For diagnostic operators there are some problems with this:
+ Some problems with this:
  
-    * They work on handles, which only gives you a slice of the entire script 
execution at a time. What's more, is that at the point they may occur in a 
script they might not give you an accurate picture about the situation, since 
the execution plans might change once the entire script is handled.
+    * Explain works on handles, which only gives you a slice of the entire 
script execution at a time. What's more, is that at the point they may occur in 
a script they might not give you an accurate picture about the situation, since 
the execution plans might change once the entire script is handled.
-    * They change the logical tree. This means that we need to clone the tree 
before we run them - something that we want to avoid in batch execution.
+    * Debugging on the grunt shell is more complicated, since scripts run 
differently that what one might type on the shell.
  
  The proposal therefore is:
  
-    * Have Pig in batch mode ignore explain, dump, illustrate and describe.
-    * Add a load command to the shell to execute a script in interactive mode.
+    * Add a run/exec commands to the shell to execute a script in interactive 
or batch mode for debugging.
-    * Add scripts as a target (in additions to handles) to some diagnostic 
parameters.
+    * Add scripts as a target (in additions to handles) to explain.
     * Add dot as an output type to explain (a graphical explanation of the 
graph will make multi-query explains more understandable.)
  
- That means that while someone is developing a PIG script they can put any 
diagnostic operator into the script and then go to the grunt shell and load the 
script. The statement will be executed and give you some information about that 
part of the script. When a script is loaded, the user will also be able to 
refer to any handles defined in the script on the shell. 
- 
- Finally, when the script is ready the user can run the same script in batch 
and all the diagnostic operators are ignored.
- 
- [[Anchor(Load)]]
+ [[Anchor(Run)]]
- ==== Load ====
+ ==== Run ====
  
  (See https://issues.apache.org/jira/browse/PIG-574 - this is basically the 
same as requested there)
  
  The new command has the format:
  
  {{{
- load <script name>
+ run [-param <key>=<value>]* [-param_file <filename>] <script name>
  }}}
  
  Which will run the script in interactive mode.
+ 
+ [[Anchor(Exec)]]
+ ==== Exec ====
+ 
+ The new command has the format:
+ 
+ {{{
+ exec [-param <key>=<value>]* [-param_file <filename>]* [<script name>]
+ }}}
+ 
+ Which will run the script in batch mode. Exec without any parameters can be 
used in scripts to force execution up to the point in the script where the exec 
occurs.
  
  [[Anchor(Explain)]]
  ==== Explain ====
@@ -125, +131 @@

  Changes to the command:
  
  {{{
- explain <script>||<handle> [using text||dot] [into <path>]
+ explain [-out <path>] [-brief] [-dot] [-param <key>=<value>]* [-param_file 
<filename>]* [-script <scriptname>] [<handle>]
  }}}
  
  Behavior:
  
-    * Explain is not executed in batch mode.
-    * If explain is given a script, it will output the entire execution graph 
(logical, physical, MR + moving result files)
+    * If explain is given a script without a handle, it will output the entire 
execution graph (logical, physical, MR)
+    * If explain is given a script with a handle, it will output the plan for 
the handle given
+    * If no script is given, explain works as before
  
- Text/Dot:
+ Dot:
  
     * Text will give what we have today, dot will output a format that can be 
passed to dot for graphical display.
     * In Text mode, multiple output (split) will be broken out in sections.
-    * Default (no using clause): Text
+    * Default (-dot): Text
  
- Path:
+ Out:
  
-    * Will generate logical.[txt||dot], physical.[txt||dot], mapred.[txt||dot] 
in the specified directory.
+    * Will generate logical_plan.[txt||dot], physical_plan.[text||dot], 
exec_plan.[text||dot] in the specified directory.
     * Default (no path given): Stdout
  
+ Brief:
- [[Anchor(Illustrate)]]
- ==== Illustrate ====
  
- Changes to the command:
+    * Does not expand nested plans
  
+ Param/Param_file:
- {{{
- illustrate <script>||<handle> [into <file>]
- }}}
  
+    * Allows for param substitution in scripts.
- Behavior:
- 
-    * Illustrate is not executed in batch mode.
-    * If illustrate is given a script, it will output the entire execution 
graph (logical, physical, MR + moving result files)
- 
- File:
- 
-    * Will write the illustrate output into the specified file.
-    * Default: Stdout
  
  [[Anchor(Phases)]]
  == Phases ==
@@ -203, +199 @@

  
  In order to do this we will change the batch mode of the parser to:
  
-    * Not execute the plan when we see a store (or dump, illustrate, describe, 
explain - which will be ignored)
+    * Not execute the plan when we see a store or dump
     * Alter the already existing merge functionality to allow intersecting 
graphs to be joined into a single logical plan.
     * Wait until the entire script is parsed and merged before sending the 
plan on to do validation, optimization, etc.
  
- The new "load" command will simply feed all the lines of a script through the 
interactive mode.
+ The new "run" command will simply feed all the lines of a script through the 
interactive mode.
  
- [[Anchor(Explain,_Dump,_and_Illustrate_(Phase_1))]]
- ==== Explain, Dump, Describe and Illustrate (Phase 1) ====
+ [[Anchor(Explain_(Phase_1))]]
+ ==== Explain (Phase 1) ====
  
  As described above the changes are:
  
-    * Ignore these operations in batch mode
     * Add options to explain and illustrate to work on a script file as well 
as a handle.
     * Add the ability to print plans as dot files and to write explain and 
illustrate output to files.

[Pig Wiki] Update of "PigMultiQueryPerformanceSpecification" by GuntherHagleitner

Reply via email to