Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by GuntherHagleitner:
http://wiki.apache.org/pig/PigMultiQueryPerformanceSpecification

------------------------------------------------------------------------------
  == Internal Changes ==
  
  [[Anchor(Grunt_parser_(Phase_1))]]
- ==== Grunt parser (Phase 1) ====
+ ==== GruntParser/PigServer (Phase 1) ====
- The parser currently uses a bottom up approach. When it sees a store (dump, 
explain), it goes bottom up and generates the plan that needs to happen for 
this particular store. In order to optimize the multi-query example, we need, 
however, a peek on the entire graph for a script (interactive mode can be 
handled differently).
+ The parser currently uses a bottom up approach. When it sees a store (dump, 
explain), it goes bottom up and generates the plan that needs to happen for 
this particular store. In order to optimize the multi-query example, we need, 
however, a peek on the entire graph for a script (interactive mode is handled 
differently).
  
- In order to do this we will change the batch mode of the parser to:
+ The highlevel changes to the parser/server are:
  
     * Not execute the plan when we see a store or dump
     * Alter the already existing merge functionality to allow intersecting 
graphs to be joined into a single logical plan.
@@ -520, +520 @@

  
  The new "run" command will simply feed all the lines of a script through the 
interactive mode.
  
+ The PigServer has a new interface:
+    * setBatchOn()
+       * By default batch mode is off and we are in interactive mode. 
setBatchOn starts a new batch for execution that will not execute on store.
+       * setBatchOn can be called multiple times and will produce a nested set 
of batches
+    * executeBatch()
+       * Whenever batch mode is on, execute will process all the stores that 
are currently in the batch and have not been processed before
+    * discardBatch()
+       * Removes the current batch and goes back to the previous batch (or 
interactive mode, if there are no more)
+    * isBatchOn()
+       * Tells weather batch mode is on
+    * isBatchEmpty()
+       * Helper function that tells whether there are any unprocessed stores 
in the current batch
+ 
+ Internally all different batches and the interactive mode are represented as 
Graph objects. A graph object maintains a cache of the registered queries, 
keeps track of the logical plan and the processed/unprocessed stores. Graphs 
themselves can be interactive or batch. Interactive graphs will execute on 
store, batch graphs won't.
+ 
+ The PigServer maintains a stack of these graph objects so that 
setBatchOn/discardBatch operations basically become push and pop operations.
+ 
+ If the multi-query optimization is turned off all graphs will be generated as 
interactive, which is how we revert the behavior.
+ 
+ The merging of the different logical plans is done in the OperatorPlan, where 
merges can now either check for disjoint graphs or merge them with overlaps.
+ 
+ Finally, the store-load handling is done in the pig server. It will either 
transform the plan or add a store-load connection. Absolute filenames will be 
available, since the QueryParser now translates them when it sees them.
+ 
+ The grunt parser makes use of the new PigServer APIs. It will use the batch 
API to parse and execute scripts. Since scripts can be nested, the script 
execution and the stack of graphs in the PigServer are closely related.
+ 
  [[Anchor(Explain_(Phase_1))]]
  ==== Explain (Phase 1) ====
  
  As described above the changes are:
  
-    * Add options to explain and illustrate to work on a script file as well 
as a handle.
+    * Add options to explain to work on a script file as well as a handle.
-    * Add the ability to print plans as dot files and to write explain and 
illustrate output to files.
+    * Add the ability to print plans as dot files and to write output to files.
+    * Allow for a "brief" options to control verbosity.
  
- There will be some work to nicely represent the graphs resulting from explain 
in text form. Right now operators with multiple outputs will result in the 
ancestor tree be duplicated for each output. It might be nicer to show the 
ancestors once and mark the other places as copies of that one.
+ The explain command already supports writing to PrintStreams, so adding the 
capability of writing to files was fairly straight forward.
+ 
+ Parameters for the desired output format as well as the verbosity were added 
to the explain calls in the execution engine, logical and physical plan.
+ 
+ The dot output is realized by the DotPlanDumper. Since dot does not care 
about any specific order, we simply iterate over the nodes and edges and dump 
them. Nested plans are realized as subgraphs which will case a recursion in the 
DotPlanDumper. Printing a map reduce plan will first start with the MROperator 
plan and then recurse into the physical plans.
+ 
+ Invisible nodes and edges are use to force the right layout of subgraphs. 
There is one invisible input and one invisible output per subgraph that are 
connected (via invisible edges) to the roots and leaves of the nested plan 
respectively. This way top to bottom layout of the nested graph is ensured.
+ 
+ CoGroups are special in that their nested plans are connected to input 
operators. This is modeled as subgraphs as well and the subgraphs are connected 
to the respective input operator.
  
  [[Anchor(Local_Execution_engine_(Phase_1))]]
  ==== Local Execution engine (Phase 1) ====

Reply via email to