Yes, I already implemented the "NOT push down" upfront, so you do not need to do that.
The support of CNF will probably be the most difficulty part. But as I mentioned last time, you should compare the cost after the trimming CNF to get the post-split filtering logic. Given the complexity of manipulating CNF and undetermined benefits, I am not sure it should be in scope at this moment or not. To handle CNF, I think it's a good idea to create a new plan and connect the nodes in the new plan to the base plan as you envisioned. In my changes, which uses DNF instead of CNF but processing is similar otherwise, I use a LogicalExpressionProxy, which contains a "source" member that is just the node in the original plan, to link the nodes in the new plan and old plan. The original LogicalExpression is enhanced with a counter to trace the # of proxies of the original nodes since normal form creation will "spread" the nodes in the original tree across many normalized nodes. The benefit, aside from not setting the plan, is that the original expression is trimmed according to the processing results from DNF; while DNF is created separately and as a kinda utility so that complex features can be used. In my changes, I used multiple-child tree in DNF while not changing the original binary expression tree structure. Another benefit is that the original tree is kept as much as it is at the start, i.e., I do not attempt to optimize its overall structure beyond trimming based upon the simplification logics. (I also control the size of DNF to 100 nodes.) The down side of this is added complexity. But in your case, for scenario 2 which is the whole point to use CNF, you would need to change the original expression tree structurally beyond trimming for post-split filtering logic. The other benefit of using multiple-child expression is depending upon if you plan to support such expression to replace current binary tree in the final plan. Even though I think it's a good idea to support that, but it is not in my scope now. I'll add my algorithm details soon to my jira. Please take a look and comment as you see appropriate. Thanks, Yan ________________________________ From: Swati Jain [mailto:swat...@aggiemail.usu.edu] Sent: Friday, July 09, 2010 11:00 PM To: Yan Zhou Cc: pig-dev@hadoop.apache.org Subject: Re: PIG Logical Optimization: Use CNF in SplitFilter Hi Yan, I agree that the first scenario (filter logic applied to individual input sources) doesn't need conversion to CNF and that it will be a good idea to add CNF functionality for the second scenario. I was also planning to provide a configurable threshold value to control the complexity of CNF conversion. As part of the above, I wrote a utility to push the "NOT" operator in predicates below the "AND" and "OR" operators (Scenario 2 in PIG-1399). I am considering making this utility to push NOT a separate rule in itself. Lmk if you have already implemented this. While implementing this utility I am facing some trouble in maintaining OperatorPlan consistent as I rewrite the expression. This is because each operator is referencing the main filter logical plan. Here is my current approach of implementation: 1. I am creating a new LogicalExpressionPlan for the converted boolean expression. 2. I am creating new logical expressions while pushing the NOT operation, converting AND into OR, OR into AND eliminating NOT NOT pairs. 3. However, I am having trouble updating the LogicalExpressionPlan if it reaches the base case ( i.e. root operator is not NOT,AND,OR). D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 < 10) AND (c3+b3 > 10 ) ) ) ); In the above, for example, I am not sure how to integrate base expression (c2 == 5) into the new LogicalExpressionPlan. There is no routine to set the plan for a given operator and its children. Also, there is currently no way to deepCopy an expression into a new OperatorPlan. It would be great if you could give me some suggestions on what approach to take for this. One approach I thought of is to visit the base expression and create and connect the base expression to the LogicalExpressionPlan as I visit it. Thoughts? Swati ps: About your other point regarding binary vs multi way trees, the way I am creating the normal form is a list of conjuncts, where each conjunct is a list of disjuncts. This is logically similar to a multi waytree. However, the current modeling of boolean expressions (modeled as binary expressions) requires a conversion back to the binary tree model when adding back to the main plan. On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou <y...@yahoo-inc.com> wrote: Swati, I happen to be working on the logical expression simplification effort (https://issues.apache.org/jira/browse/PIG-1399), but not on the filter split front. So I guess our interests will have some overlaps. I think the filter logic split problem can be divided into 2 parts: 1) the filtering logic that can be applied to individual input sources; and 2) the filtering logic that has to be applied when merged(or joined) inputs are processed. The benefits for 1) are any benefits if the underlying storage supports predicate pushdown; plus the memory/CPU savings by PIG for not processing the unqualified rows. For 2), the purpose is not paying higher evaluation costs than necessary. For 1), no normal form is necessary. The original logical expression tree can be trimmed off any sub-expressions that are not constants nor just from a particular input source. The complexity is linear with the tree size; while the use of normal form could potentially lead to exponential complexity. The difficulty with this approach is how to generate the filtering logic for 2); while CNF can be used to easily figure out the logic for 2). However, the exact logic in 2) might not be cheaper to evaluate than the original logical expression. An example is "Filter J2 by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2) the filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND ((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost will be 5 logical evaluations (3 OR plus 2 AND), which could be reduced to 4, compared with 3 logical evaluations in the original form. In summary, if only 1) is desired, the tree trimming is enough. If 2) is desired too, then CNF could be used but its complexity should be controlled and the cost of the filtering logic evaluation in 2) should be computed and compared with the original expression evaluation cost. Further optimization is possible in this direction. Another potential optimization to consider is to support logical expression tree of multiple children vs. the binary tree after taking into consideration of the commutative property of OR and AND operations. The advantages are less tree traversal costs and easier to change the evaluation ordering within the same sub-tree in order to maximize the possibilities to short-cut the evaluations. Although this is general for all logical expressions, this tends to be more suitable for normal form handlings as normal forms group the sub-expressions by the operators that act on the sub-expressions. Thanks, Yan -----Original Message----- From: Swati Jain [mailto:swat...@aggiemail.usu.edu] Sent: Monday, July 05, 2010 2:34 AM To: pig-dev@hadoop.apache.org Cc: Daniel Dai Subject: PIG Logical Optimization: Use CNF in SplitFilter Hi, I am interested in implementing logical optimization rules and to target this I have studied currently implemented logical rules and the rule framework. In particular, I felt that rules dealing with LOfilter are not able to handle complicated boolean expressions. I would like to share suggestions to improve handling of boolean expressions in LOFilter to enable better optimization. 1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into two by "AND". However it will not be able to split LOFilter if the top level operator is "OR". For example: *ex script:* A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int); B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int); C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int); J1 = JOIN B by b1, C by c1; J2 = JOIN J1 by $0, A by a1; D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);* explain D; In the above example current rule is not able to any filter condition across any join as it contains columns from all branches (inputs). But if we convert this expression into "Conjunctive Normal Form" (CNF) then we would be able to push filter condition c1< 10 and c2 == 5 below both join conditions. Here is the CNF expression for highlighted line: ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) ) *Suggestions:* It would be a good idea to convert LOFilter's boolean expression into CNF, it would then be easy to push parts (conjuncts) of the LOFilter boolean expression selectively. I have started thinking about the design for implementing this conversion (arbitrary boolean expression to CNF) and would appreciate any feedback or ideas. Thanks! Swati