[Pig Wiki] Update of "PigMultiQueryPerformanceSpecification" by GuntherHagleitner

Apache Wiki Wed, 15 Apr 2009 00:37:44 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by GuntherHagleitner:
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification

------------------------------------------------------------------------------
  
     * In the second case we add an implicit split to transform the query to 
case number one. That eliminates the processing of A' multiple times.
     * Make the split non-blocking and allow processing to continue. This helps 
to reduce the amount of data that has to be stored right at the split.
-    * Allow multiple outputs from a job. This way we can store some results as 
a side-effect. This is also necessary to make the previous item work.
+    * Allow multiple outputs from a job. This way we can store some results as 
a side-effect of the main job. This is also necessary to make the previous item 
work.
     * Allow multiple split branches to be carried on to the combiner/reducer. 
This reduces the amount of IO again in the case where multiple branches in the 
split can benefit from a combiner run.
  
  [[Anchor(Storing_intermediate_results)]]
@@ -72, +72 @@

  
     * Implicit splits: It's probably what you expect when you use the same 
handle in different stores.
     * Store/Load vs Split: When optimizing, it's a reasonable assumption that 
splits are faster than load/store combinations
-    * Side-effects: There is no way right now to make use of this
+    * Side-files: Side-files (multiple output from a single map-reduce job) is 
available in hadoop, but cannot be made use of in pig in the current system.
  
  [[Anchor(Changes)]]
  === Changes ===
@@ -112, +112 @@

  run [-param <key>=<value>]* [-param_file <filename>] <script name>
  }}}
  
- Which runs the script in interactive mode, so every store triggers execution. 
The statements from the script are put into the command history and all the 
handles 
+ Which runs the script in interactive mode, so every store triggers execution. 
The statements from the script are put into the command history 
+ and all the handles defined in the script can be referenced in subsequent 
statements after the run command has completed. Issuing a run command
+ on the grunt command line has basically the same effect as typing the 
statements manually.
  
  [[Anchor(Exec)]]
  ==== Exec ====
@@ -123, +125 @@

  exec [-param <key>=<value>]* [-param_file <filename>]* [<script name>]
  }}}
  
+ Which will run the script in batch mode. Store statements will not trigger 
execution; Rather the entire script will be parsed before execution
+ starts. Unlike the "run" command, exec does not change the command history or 
remembers the handles used inside the script. Exec without any 
- Which will run the script in batch mode. Exec without any parameters can be 
used in scripts to force execution up to the point in the script where the exec 
occurs.
+ parameters can be used in scripts to force execution up to the point in the 
script where the exec occurs.
  
  [[Anchor(Explain)]]
  ==== Explain ====
@@ -218, +222 @@

  [[Anchor(HBaseStorage)]]
  ==== HBaseStorage ====
  Scripts using the HBaseStorage loader will trigger a warning with the 
multi-query optimization turned on. The reason
- is the same as described above. 
+ is the same as described above. Table names (since they are given without a 
scheme) will be interpreted as relative 
+ hdfs paths and the HBaseStorage function will see an expanded path of the 
form "hdfs://<host>:<port>/.../<tablename>".
+ The storage function will in this case take whatever is after the last "/" in 
the string and try to use it as the name
+ of the requested table. The warning will notify the user of this situation.
  
  Scripts like:
  
@@ -508, +515 @@

  
  If the multi-query optimization is turned off all graphs will be generated as 
interactive, which is how we revert the behavior.
  
- The merging of the different logical plans is done in the OperatorPlan, where 
merges can now either check for disjoint graphs or merge them with overlaps. 
Merged plans are later passed on to the implicit split inserter at which point 
all the overlapping operators from the merge will result in an explicit split.
+ The merging of the different logical plans is done in the OperatorPlan, where 
merge operations can now either check for disjoint graphs or merge them with 
overlaps. Merged plans are later passed on to the implicit split inserter at 
which point all the overlapping operators from the merge will result in an 
explicit split.
  
  Finally, the store-load handling is done in the pig server. It will either 
transform the plan or add a store-load connection. Absolute filenames will be 
available, since the QueryParser now translates them when it sees them.

[Pig Wiki] Update of "PigMultiQueryPerformanceSpecification" by GuntherHagleitner

Reply via email to