[Pig Wiki] Update of "PigMultiQueryPerformanceSpecification" by GuntherHagleitner

Apache Wiki Thu, 09 Apr 2009 23:44:59 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by GuntherHagleitner:
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification

------------------------------------------------------------------------------
  
  Both of the above are equivalent, but the performance will be different. 
  
- Here's what we plan to do to increase the performance:
+ Here's the multi-query feature does to increase the performance:
  
-    * In the second case we will add an implicit split to transform the query 
to case number one. That will eliminate the processing of A' multiple times.
+    * In the second case we add an implicit split to transform the query to 
case number one. That eliminates the processing of A' multiple times.
-    * Make the split non-blocking and allow processing to continue. This will 
help reduce the amount of data that has to be stored right at the split.
+    * Make the split non-blocking and allow processing to continue. This helps 
to reduce the amount of data that has to be stored right at the split.
     * Allow multiple outputs from a job. This way we can store some results as 
a side-effect. This is also necessary to make the previous item work.
-    * Allow multiple split branches to be carried on to the combiner/reducer. 
This will reduce the amount of IO again in the case where multiple branches in 
the split can benefit from a combiner run.
+    * Allow multiple split branches to be carried on to the combiner/reducer. 
This reduces the amount of IO again in the case where multiple branches in the 
split can benefit from a combiner run.
  
  [[Anchor(Storing_intermediate_results)]]
  ==== Storing intermediate results ====
@@ -80, +80 @@

  [[Anchor(Execution_in_batch_mode)]]
  ==== Execution in batch mode ====
  
- Batch mode is entered when Pig is given a script to execute. Interactive mode 
is on the grunt shell ("grunt:>"). Right now there isn't much difference 
between them. In order for us to optimize the multi-query case, we'll need to 
distinguish the two more. 
+ Batch mode is entered when Pig is given a script to execute (e.g.: invoking 
with a script as parameter, or using the "-e" or "-f" Pig options) Interactive 
mode is on the grunt shell ("grunt:>"). There wasn't much difference between 
them. In order for us to optimize the multi-query case, we needed to 
distinguish the two some more. 
  
- Right now whenever the parser sees a store (or dump, illustrate) it will kick 
of the execution of that part of the script. Part of this proposal is that in 
batch mode, we parse the entire script first and see if we can combine things 
to reduce the overall amount of work that needs to be done. Only after that 
will the execution start. 
+ Right now whenever the parser sees a store (or dump, illustrate) it will kick 
of the execution of that part of the script. Part of the changes was to change 
batch mode to parse the entire script first and see if we can combine things to 
reduce the overall amount of work that needs to be done. Only after that will 
the execution start. 
  
- The following changes are proposed (in batch):
+ The high level changes are:
  
-    * Store will not trigger an immediate execution. The entire script is 
considered before the execution starts.
+    * Store does not trigger an immediate execution. The entire script is 
considered before the execution starts.
-    * Explicit splits will be put in places where a handle has multiple 
children. If the user wants to explicitly force re-computation of common 
ancestors she has to provide multiple scripts.
-    * Multiple split branches/stores in the script will be combined into the 
same job, if possible. Again, using multiple scripts is the way to go to avoid 
this (if that is desired).
+    * Explicit splits will be put in places where a handle has multiple 
children.
+    * Multiple split branches/stores in the script will be combined into the 
same job, if possible.
  
  Some problems with this:
  
     * Explain works on handles, which only gives you a slice of the entire 
script execution at a time. What's more, is that at the point they may occur in 
a script they might not give you an accurate picture about the situation, since 
the execution plans might change once the entire script is handled.
     * Debugging on the grunt shell is more complicated, since scripts run 
differently that what one might type on the shell.
  
- The proposal therefore is:
+ Additional changes therefore are:
  
     * Add a run/exec commands to the shell to execute a script in interactive 
or batch mode for debugging.
     * Add scripts as a target (in additions to handles) to explain.
@@ -112, +112 @@

  run [-param <key>=<value>]* [-param_file <filename>] <script name>
  }}}
  
- Which will run the script in interactive mode.
+ Which runs the script in interactive mode, so every store triggers execution. 
The statements from the script are put into the command history and all the 
handles 
  
  [[Anchor(Exec)]]
  ==== Exec ====
@@ -153, +153 @@

  
  Brief:
  
-    * Does not expand nested plans
+    * Does not expand nested plans (presenting a smaller graph for overview)
  
  Param/Param_file:
  
     * Allows for param substitution in scripts.
+ 
+ [[Anchor(Turning_off_multi_query)]]
+ === Turning off multi query ===
+ 
+ By default the multi-query optimization is enabled and scripts execution will 
be handled accordingly. If it is desired 
+ to turn off the optimization and revert to "execute-on-store" behavior, the 
"-M" or "-no_multiquery" switches can be used.
+ 
+ In order to run script "foo.pig" without the optimization, execute pig as 
follows:
+ 
+ {{{
+ $ pig -M foo.pig
+ or:
+ $ pig -no_multiquery foo.pig
+ }}}
+ 
+ If pig is launched in interactive mode with this switch "exec" statements are 
also going to run in interactive mode.
+ 
+ [[Anchor(Incompatible_Changes)]]
+ === Incompatible Changes ===
+ Most existing scripts produce the same result with or without the multi-query 
optimization. There are cases though
+ were this is not true. 
+ 
+ [[Anchor(Path_Names_and_Schemes)]]
+ ==== Path Names and Schemes ====
+ Any script is parsed in it's entirety before it is sent to execution. Since 
the current directory can change 
+ throughout the script any path used in load or store is translated to a fully 
qualified and absolute path.
+ 
+ In map-reduce mode, the following script:
+ 
+ {{{
+ cd /;
+ A = load 'foo';
+ cd tmp;
+ store A into 'bar';
+ }}}
+ 
+ will load from "hdfs://<host>:<port>/foo" and store into 
"hdfs://<host>:<port>/tmp/bar".
+ 
+ These expanded paths are going to be passed to any LoadFunc or Slicer 
implementation. In some cases, especially when
+ a LoadFunc/Slicer is not used to read from a dfs file or path (e.g.: loading 
from an SQL database), this can cause 
+ problems.
+ 
+ Solutions are to either:
+    * Specify "-M" or "-no_multiquery" to revert to the old names
+    * Specify a custom scheme for the LoadFunc/Slicer
+ 
+ Arguments used in a load statement that have a scheme other than "hdfs" or 
"file" will not be expanded and passed
+ to the LoadFunc/Slicer unchanged.
+ 
+ In the SQL case:
+ 
+ {{{
+ A = load "sql://footable" using SQLLoader();
+ }}}
+ 
+ Will invoke the SQLLoader function with "sql://footable".
+ 
+ [[Anchor(HBaseStorage)]]
+ ==== HBaseStorage ====
+ Scripts using the HBaseStorage loader will trigger a warning with the 
multi-query optimization turned on. The reason
+ is the same as described above. 
+ 
+ Scripts like:
+ 
+ {{{
+ A = load 'table' using 
org.apache.pig.backend.hadoop.hbase.HBaseStorage('a','b');
+ }}}
+ 
+ Should be changed to:
+ 
+ {{{
+ A = load 'hbase://table' using 
org.apache.pig.backend.hadoop.hbase.HBaseStorage('a','b');
+ }}}
+ 
+ To avoid the warning. Using "-M" or "-no_multiquery" will also remove the 
warning.
+ 
+ [[Anchor(Implicit_Dependencies)]]
+ ==== Implicit Dependencies ====
+ If a script has dependencies on the execution order outside of what Pig knows 
about, execution might fail.
+ 
+ For instance:
+ 
+ {{{
+ ...
+ store A into 'foo';
+ B = load 'bar';
+ C = foreach B generate MYUDF($0,'foo');
+ store C into 'baz';
+ }}}
+ 
+ MYUDF might try to read form the file foo, a file that handle A was just 
stored into. However, Pig does not know
+ that MYUDF depends on the file foo and might submit the jobs producing the 
files baz and foo at the same time, so
+ execution might fail.
+ 
+ In order to make this work, the script has to be changed to:
+ 
+ {{{
+ ...
+ store A into 'foo';
+ exec;
+ B = load 'bar';
+ C = foreach B generate MYUDF($0,'foo');
+ store C into 'baz';
+ }}}
+ 
+ The exec statment will trigger the execution of the job resulting in the file 
foo. This way the right execution order
+ is enforced.
  
  [[Anchor(Phases)]]
  == Phases ==

[Pig Wiki] Update of "PigMultiQueryPerformanceSpecification" by GuntherHagleitner

Reply via email to