[Pig Wiki] Update of "NestedLogicalPlan" by PiSong

Apache Wiki Tue, 06 May 2008 06:28:12 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by PiSong:
http://wiki.apache.org/pig/NestedLogicalPlan

------------------------------------------------------------------------------
  = Nested Logical Plan Model =
- {!!!First draft!!!}
+ {!!!draft 0.2 !!!}
+ 
+ == Preface ==
+ At least, I would expect us to better understand Pig language and logical 
model conceptually. Some concepts might be useful for implementations. Some 
concepts might be too far from practical applications.
  
  == Benefits ==
   * Support n level nested queries. 
   * Almost all operators are supported in all levels.
   * Allow the language to be more flexible 
   * Consistent and logical design. Easier to implement additional features in 
the future. 
+ 
+ == What is Logical Plan? ==
+ 
+ A logical plan is an isomorph of a Pig query. It captures all the information 
needed for logical execution (an abstract execution performed sequentially by 
logical reasoning). Any implementation of actual physical plan and physical 
execution engine always yields the same result as logical execution regardless 
of underlying backends or optimization techniques.
+   
+ For example, one can think about it this way:-
+ 
+ 3 * (4 + 5)
+ 
+ The logical execution of this plan should always yield 27 regardless of 
whether we use a calculator or a Macbook to process it, or whether we 
distribute 3 over addition or not.
+ 
+ Practically a logical plan can be associated with additional useful 
information/flags to dictate some of physical execution's behaviors (eg. 
parallelism degree or optimization hints). This has to be considered carefully 
to keep the language generic.
+ 
+ Also, from the definition given above, we will need a reference execution 
engine implementation when we support more backends.
  
  == What is Load/Store? ==
  ==== In current implementation ====
@@ -32, +49 @@

  || StoreField || Field || Tuple ||
  
   1. This makes it possible to create a consistent nested processing model.
+  1. This gives more flexibility to the language beyond plans only consisting 
of bag-based operators. 
+ 
+ 
+ == What is Pig Query? ==
+ 
+ From the query:-
+ 
+ {{{
+ A = LOAD '/tmp/data1.txt' ;
+ B = COGROUP A BY $0*$1 ;
+ C = FILTER B BY $0 > 5 ;
+ }}}
+ 
+ Functionally:-
+ {{{
+ A : File -> Bag
+ B : Bag x (f: Tuple -> Tuple) -> Bag
+ C : Bag x (f: Tuple -> Boolean) -> Bag
+ }}}
+ 
+ By 
+  1. Neglecting "A" which we consider as the bridge between file system and 
Pig processing space
+  1. Focusing on only main components of the data flow 
+ We can write:-
+ {{{
+ (C o B) : Bag -> Bag
+ }}}
+ We can say "A Pig query is a function from Bag to Bag" :-
+ {{{
+ f: Bag -> Bag         This is our current logical plan.
+ }}}
+ 
+ By generalizing it, we should be able to say:-
+ {{{
+ f: Datum -> Datum     This is generalized logical plan
+ }}}
+ 
+ This abstraction indicates that Pig query (and logical plan as its isomorph) 
should be able to process from any data type to any data type.
+ 
+ 
+ == What is nested plan? ==
+ 
+ From the query:-
+ 
+ {{{
+ A = LOAD '/tmp/data1.txt' ;
+ B = COGROUP A BY $0*$1 ;
+ C = FILTER B BY $0 > 5 ;
+ }}}
+ By considering COGROUP operator, these are operations that have to be taken 
sequentially:-
+  1. Iterate through the bag from input port
+  1. For each tuple in the bag, calculate $0 * $1, append it as an element in 
the tuple, and tag it as field being used for grouping.
+  1. Once (2) is complete for the whole bag, start grouping by the tagged 
field.
+  1. Output data to the output port.
+ 
+ This is generalized version:-
+  1. Iterate through the bag from input port
+  1. For each tuple in the bag, apply f: Tuple -> Tuple , append output tuple 
as an element in the tuple, and tag it as field being used for grouping.
+  1. Once (2) is complete for the whole bag, start grouping by the tagged 
field.
+  1. Output data to the output port.
+ 
+ From the generalized logical plan which is:-
+ {{{
+ f: Datum -> Datum
+ }}}
+ 
+ This indicates that __"f: Tuple -> Tuple  in COGROUP operator is also a 
logical plan"__. This maps to the actual implementation that we need an inner 
logical plan for COGROUP operator.
+ 
+ == Applications ==
+ 
-  1. This gives more flexibility to the language beyond plans only consisting 
of bag-based operators. For example, if I have implemented a UDF called 
TupleMinMax of type Tuple -> Tuple that does find MIN/MAX across all the 
elements. One way to try this out in Grunt would be:-
+ For example, if I have implemented a UDF called TupleMinMax of type Tuple -> 
Tuple that does find MIN/MAX across all the elements. One way to try this out 
in Grunt would be:-
  {{{
  A = <1,2,3,4,5,6,7,8,9,10> ;
  B = TupleMinMax(A) ;
@@ -271, +358 @@

  
  ==== LOProject ====
  This operator is only for mapping input tuple to output tuple (eg. 
{A,B,C,D,E} ==> {A,C,D} ). Given the fact that we allow users to have fields in 
COGROUP, FILTER, FOREACH as expressions, LOProject then becomes just a special 
case when users merely specify direct mapping. Since we have agreed upon the 
concept of inner plans, I think LOProject is not needed.
+ 
  [shrav]Project is a consistent way implementing these fields that the user 
mentions without letting the user bother about all the conversions he might 
need to do if we just pass the raw tuple to him. Also you can only project out 
one field and not multiple fields.
+ [pi] What you mentioned here is different from the current implementation.

[Pig Wiki] Update of "NestedLogicalPlan" by PiSong

Reply via email to