Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by PiSong: http://wiki.apache.org/pig/NestedLogicalPlan ------------------------------------------------------------------------------ = Nested Logical Plan Model = - {!!!First draft!!!} + {!!!draft 0.2 !!!} + + == Preface == + At least, I would expect us to better understand Pig language and logical model conceptually. Some concepts might be useful for implementations. Some concepts might be too far from practical applications. == Benefits == * Support n level nested queries. * Almost all operators are supported in all levels. * Allow the language to be more flexible * Consistent and logical design. Easier to implement additional features in the future. + + == What is Logical Plan? == + + A logical plan is an isomorph of a Pig query. It captures all the information needed for logical execution (an abstract execution performed sequentially by logical reasoning). Any implementation of actual physical plan and physical execution engine always yields the same result as logical execution regardless of underlying backends or optimization techniques. + + For example, one can think about it this way:- + + 3 * (4 + 5) + + The logical execution of this plan should always yield 27 regardless of whether we use a calculator or a Macbook to process it, or whether we distribute 3 over addition or not. + + Practically a logical plan can be associated with additional useful information/flags to dictate some of physical execution's behaviors (eg. parallelism degree or optimization hints). This has to be considered carefully to keep the language generic. + + Also, from the definition given above, we will need a reference execution engine implementation when we support more backends. == What is Load/Store? == ==== In current implementation ==== @@ -32, +49 @@ || StoreField || Field || Tuple || 1. This makes it possible to create a consistent nested processing model. + 1. This gives more flexibility to the language beyond plans only consisting of bag-based operators. + + + == What is Pig Query? == + + From the query:- + + {{{ + A = LOAD '/tmp/data1.txt' ; + B = COGROUP A BY $0*$1 ; + C = FILTER B BY $0 > 5 ; + }}} + + Functionally:- + {{{ + A : File -> Bag + B : Bag x (f: Tuple -> Tuple) -> Bag + C : Bag x (f: Tuple -> Boolean) -> Bag + }}} + + By + 1. Neglecting "A" which we consider as the bridge between file system and Pig processing space + 1. Focusing on only main components of the data flow + We can write:- + {{{ + (C o B) : Bag -> Bag + }}} + We can say "A Pig query is a function from Bag to Bag" :- + {{{ + f: Bag -> Bag This is our current logical plan. + }}} + + By generalizing it, we should be able to say:- + {{{ + f: Datum -> Datum This is generalized logical plan + }}} + + This abstraction indicates that Pig query (and logical plan as its isomorph) should be able to process from any data type to any data type. + + + == What is nested plan? == + + From the query:- + + {{{ + A = LOAD '/tmp/data1.txt' ; + B = COGROUP A BY $0*$1 ; + C = FILTER B BY $0 > 5 ; + }}} + By considering COGROUP operator, these are operations that have to be taken sequentially:- + 1. Iterate through the bag from input port + 1. For each tuple in the bag, calculate $0 * $1, append it as an element in the tuple, and tag it as field being used for grouping. + 1. Once (2) is complete for the whole bag, start grouping by the tagged field. + 1. Output data to the output port. + + This is generalized version:- + 1. Iterate through the bag from input port + 1. For each tuple in the bag, apply f: Tuple -> Tuple , append output tuple as an element in the tuple, and tag it as field being used for grouping. + 1. Once (2) is complete for the whole bag, start grouping by the tagged field. + 1. Output data to the output port. + + From the generalized logical plan which is:- + {{{ + f: Datum -> Datum + }}} + + This indicates that __"f: Tuple -> Tuple in COGROUP operator is also a logical plan"__. This maps to the actual implementation that we need an inner logical plan for COGROUP operator. + + == Applications == + - 1. This gives more flexibility to the language beyond plans only consisting of bag-based operators. For example, if I have implemented a UDF called TupleMinMax of type Tuple -> Tuple that does find MIN/MAX across all the elements. One way to try this out in Grunt would be:- + For example, if I have implemented a UDF called TupleMinMax of type Tuple -> Tuple that does find MIN/MAX across all the elements. One way to try this out in Grunt would be:- {{{ A = <1,2,3,4,5,6,7,8,9,10> ; B = TupleMinMax(A) ; @@ -271, +358 @@ ==== LOProject ==== This operator is only for mapping input tuple to output tuple (eg. {A,B,C,D,E} ==> {A,C,D} ). Given the fact that we allow users to have fields in COGROUP, FILTER, FOREACH as expressions, LOProject then becomes just a special case when users merely specify direct mapping. Since we have agreed upon the concept of inner plans, I think LOProject is not needed. + [shrav]Project is a consistent way implementing these fields that the user mentions without letting the user bother about all the conversions he might need to do if we just pass the raw tuple to him. Also you can only project out one field and not multiple fields. + [pi] What you mentioned here is different from the current implementation.