Hi Yan, I agree that the first scenario (filter logic applied to individual input sources) doesn't need conversion to CNF and that it will be a good idea to add CNF functionality for the second scenario. I was also planning to provide a configurable threshold value to control the complexity of CNF conversion.
As part of the above, I wrote a utility to push the "NOT" operator in predicates below the "AND" and "OR" operators (Scenario 2 in PIG-1399). I am considering making this utility to push NOT a separate rule in itself. Lmk if you have already implemented this. While implementing this utility I am facing some trouble in maintaining OperatorPlan consistent as I rewrite the expression. This is because each operator is referencing the main filter logical plan. Here is my current approach of implementation: 1. I am creating a new LogicalExpressionPlan for the converted boolean expression. 2. I am creating new logical expressions while pushing the NOT operation, converting AND into OR, OR into AND eliminating NOT NOT pairs. 3. However, I am having trouble updating the LogicalExpressionPlan if it reaches the base case ( i.e. root operator is not NOT,AND,OR). D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 < 10) AND (c3+b3 > 10 ) ) ) ); In the above, for example, I am not sure how to integrate base expression (c2 == 5) into the new LogicalExpressionPlan. There is no routine to set the plan for a given operator and its children. Also, there is currently no way to deepCopy an expression into a new OperatorPlan. It would be great if you could give me some suggestions on what approach to take for this. One approach I thought of is to visit the base expression and create and connect the base expression to the LogicalExpressionPlan as I visit it. Thoughts? Swati ps: About your other point regarding binary vs multi way trees, the way I am creating the normal form is a list of conjuncts, where each conjunct is a list of disjuncts. This is logically similar to a multi waytree. However, the current modeling of boolean expressions (modeled as binary expressions) requires a conversion back to the binary tree model when adding back to the main plan. On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou <y...@yahoo-inc.com> wrote: > Swati, > > I happen to be working on the logical expression simplification effort > (https://issues.apache.org/jira/browse/PIG-1399), but not on the filter > split front. So I guess our interests will have some overlaps. > > I think the filter logic split problem can be divided into 2 parts: > 1) the filtering logic that can be applied to individual input sources; > and 2) the filtering logic that has to be applied when merged(or joined) > inputs are processed. > > The benefits for 1) are any benefits if the underlying storage supports > predicate pushdown; plus the memory/CPU savings by PIG for not > processing the unqualified rows. > > For 2), the purpose is not paying higher evaluation costs than > necessary. > > For 1), no normal form is necessary. The original logical expression > tree > can be trimmed off any sub-expressions that are not constants nor just > from a particular input source. The complexity is linear with the tree > size; while the use of normal form could potentially lead to exponential > complexity. The difficulty with this approach is how to generate the > filtering logic for 2); while CNF can be used to easily figure out the > logic for 2). However, the exact logic in 2) might not be cheaper to > evaluate than the original logical expression. An example is "Filter J2 > by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2) the > filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND > ((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost > will be 5 logical evaluations (3 OR plus 2 AND), which could be reduced > to 4, compared with 3 logical evaluations in the original form. > > In summary, if only 1) is desired, the tree trimming is enough. If 2) is > desired too, then CNF could be used but its complexity should be > controlled and the cost of the filtering logic evaluation in 2) should > be computed and compared with the original expression evaluation cost. > Further optimization is possible in this direction. > > Another potential optimization to consider is to support logical > expression tree of multiple children vs. the binary tree after taking > into consideration of the commutative property of OR and AND operations. > The advantages are less tree traversal costs and easier to change the > evaluation ordering within the same sub-tree in order to maximize the > possibilities to short-cut the evaluations. Although this is general for > all logical expressions, this tends to be more suitable for normal form > handlings as normal forms group the sub-expressions by the operators > that act on the sub-expressions. > > Thanks, > > Yan > > -----Original Message----- > From: Swati Jain [mailto:swat...@aggiemail.usu.edu] > Sent: Monday, July 05, 2010 2:34 AM > To: pig-dev@hadoop.apache.org > Cc: Daniel Dai > Subject: PIG Logical Optimization: Use CNF in SplitFilter > > Hi, > > I am interested in implementing logical optimization rules and to target > this I have studied currently implemented logical rules and the rule > framework. In particular, I felt that rules dealing with LOfilter are > not > able to handle complicated boolean expressions. I would like to share > suggestions to improve handling of boolean expressions in LOFilter to > enable > better optimization. > > 1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into > two by > "AND". However it will not be able to split LOFilter if the top level > operator is "OR". For example: > > *ex script:* > A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int); > B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int); > C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int); > J1 = JOIN B by b1, C by c1; > J2 = JOIN J1 by $0, A by a1; > D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);* > explain D; > > In the above example current rule is not able to any filter condition > across > any join as it contains columns from all branches (inputs). But if we > convert this expression into "Conjunctive Normal Form" (CNF) then we > would > be able to push filter condition c1< 10 and c2 == 5 below both join > conditions. Here is the CNF expression for highlighted line: > > ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) ) > > *Suggestions:* It would be a good idea to convert LOFilter's boolean > expression into CNF, it would then be easy to push parts (conjuncts) of > the > LOFilter boolean expression selectively. > > I have started thinking about the design for implementing this > conversion > (arbitrary boolean expression to CNF) and would appreciate any feedback > or > ideas. > > Thanks! > Swati >