I was wondering if you are not going to check in your patch soon then it would be great if you could share it with me. I believe I might be able to reuse some of your (utility) functionality directly or get some ideas.
About your cost-benefit question: 1) I will control the complexity of CNF conversion by providing a configurable threshold value which will limit the OR-nesting. 2) One benefit of this conversion is that it will allow pushing parts of a filter (conjuncts) across the joins which is not happening in the current PushUpFilter optimization. Moreover, it may result in a cascading effect to push the conjuncts below other operators by other rules that may be fired as a result. The benefit from this is really data dependent, but in big-data workloads, any kind of predicate pushdown may eventually lead to big savings in amount of data read or amount of data transfered/shuffled across the network (I need to understand the LogicalPlan to PhysicalPlan conversion better to give concrete examples). Thanks! Swati On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou <y...@yahoo-inc.com> wrote: > Yes, I already implemented the “NOT push down” upfront, so you do not > need to do that. > > > > The support of CNF will probably be the most difficulty part. But as I > mentioned last time, you should compare the cost after the trimming CNF to > get the post-split filtering logic. Given the complexity of manipulating CNF > and undetermined benefits, I am not sure it should be in scope at this > moment or not. > > > > To handle CNF, I think it’s a good idea to create a new plan and connect > the nodes in the new plan to the base plan as you envisioned. In my changes, > which uses DNF instead of CNF but processing is similar otherwise, I use a > LogicalExpressionProxy, which contains a “source” member that is just the > node in the original plan, to link the nodes in the new plan and old plan. > The original LogicalExpression is enhanced with a counter to trace the # of > proxies of the original nodes since normal form creation will “spread” the > nodes in the original tree across many normalized nodes. The benefit, aside > from not setting the plan, is that the original expression is trimmed > according to the processing results from DNF; while DNF is created > separately and as a kinda utility so that complex features can be used. In > my changes, I used multiple-child tree in DNF while not changing the > original binary expression tree structure. Another benefit is that the > original tree is kept as much as it is at the start, i.e., I do not attempt > to optimize its overall structure beyond trimming based upon the > simplification logics. (I also control the size of DNF to 100 nodes.) The > down side of this is added complexity. > > > > But in your case, for scenario 2 which is the whole point to use CNF, you > would need to change the original expression tree structurally beyond > trimming for post-split filtering logic. The other benefit of using > multiple-child expression is depending upon if you plan to support such > expression to replace current binary tree > > in the final plan. Even though I think it’s a good idea to support that, > but it is not in my scope now. > > > > I’ll add my algorithm details soon to my jira. Please take a look and > comment as you see appropriate. > > > > Thanks, > > > > Yan > > > > > ------------------------------ > > *From:* Swati Jain [mailto:swat...@aggiemail.usu.edu] > *Sent:* Friday, July 09, 2010 11:00 PM > *To:* Yan Zhou > *Cc:* pig-dev@hadoop.apache.org > *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter > > > > Hi Yan, > > I agree that the first scenario (filter logic applied to individual input > sources) doesn't need conversion to CNF and that it will be a good idea to > add CNF functionality for the second scenario. I was also planning to > provide a configurable threshold value to control the complexity of CNF > conversion. > > As part of the above, I wrote a utility to push the "NOT" operator in > predicates below the "AND" and "OR" operators (Scenario 2 in PIG-1399). I am > considering making this utility to push NOT a separate rule in itself. Lmk > if you have already implemented this. > > While implementing this utility I am facing some trouble in maintaining > OperatorPlan consistent as I rewrite the expression. This is because each > operator is referencing the main filter logical plan. Here is my current > approach of implementation: > > 1. I am creating a new LogicalExpressionPlan for the converted boolean > expression. > 2. I am creating new logical expressions while pushing the NOT operation, > converting AND into OR, OR into AND eliminating NOT NOT pairs. > 3. However, I am having trouble updating the LogicalExpressionPlan if it > reaches the base case ( i.e. root operator is not NOT,AND,OR). > > D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 < 10) AND (c3+b3 > 10 ) ) ) ); > > In the above, for example, I am not sure how to integrate base expression > (c2 == 5) into the new LogicalExpressionPlan. There is no routine to set the > plan for a given operator and its children. Also, there is currently no way > to deepCopy an expression into a new OperatorPlan. It would be great if you > could give me some suggestions on what approach to take for this. > > One approach I thought of is to visit the base expression and create and > connect the base expression to the LogicalExpressionPlan as I visit it. > > Thoughts? > Swati > > ps: About your other point regarding binary vs multi way trees, the way I > am creating the normal form is a list of conjuncts, where each conjunct is a > list of disjuncts. This is logically similar to a multi waytree. However, > the current modeling of boolean expressions (modeled as binary expressions) > requires a conversion back to the binary tree model when adding back to the > main plan. > > On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou <y...@yahoo-inc.com> wrote: > > Swati, > > I happen to be working on the logical expression simplification effort > (https://issues.apache.org/jira/browse/PIG-1399), but not on the filter > split front. So I guess our interests will have some overlaps. > > I think the filter logic split problem can be divided into 2 parts: > 1) the filtering logic that can be applied to individual input sources; > and 2) the filtering logic that has to be applied when merged(or joined) > inputs are processed. > > The benefits for 1) are any benefits if the underlying storage supports > predicate pushdown; plus the memory/CPU savings by PIG for not > processing the unqualified rows. > > For 2), the purpose is not paying higher evaluation costs than > necessary. > > For 1), no normal form is necessary. The original logical expression > tree > can be trimmed off any sub-expressions that are not constants nor just > from a particular input source. The complexity is linear with the tree > size; while the use of normal form could potentially lead to exponential > complexity. The difficulty with this approach is how to generate the > filtering logic for 2); while CNF can be used to easily figure out the > logic for 2). However, the exact logic in 2) might not be cheaper to > evaluate than the original logical expression. An example is "Filter J2 > by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2) the > filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND > ((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost > will be 5 logical evaluations (3 OR plus 2 AND), which could be reduced > to 4, compared with 3 logical evaluations in the original form. > > In summary, if only 1) is desired, the tree trimming is enough. If 2) is > desired too, then CNF could be used but its complexity should be > controlled and the cost of the filtering logic evaluation in 2) should > be computed and compared with the original expression evaluation cost. > Further optimization is possible in this direction. > > Another potential optimization to consider is to support logical > expression tree of multiple children vs. the binary tree after taking > into consideration of the commutative property of OR and AND operations. > The advantages are less tree traversal costs and easier to change the > evaluation ordering within the same sub-tree in order to maximize the > possibilities to short-cut the evaluations. Although this is general for > all logical expressions, this tends to be more suitable for normal form > handlings as normal forms group the sub-expressions by the operators > that act on the sub-expressions. > > Thanks, > > Yan > > > -----Original Message----- > From: Swati Jain [mailto:swat...@aggiemail.usu.edu] > Sent: Monday, July 05, 2010 2:34 AM > To: pig-dev@hadoop.apache.org > Cc: Daniel Dai > Subject: PIG Logical Optimization: Use CNF in SplitFilter > > Hi, > > I am interested in implementing logical optimization rules and to target > this I have studied currently implemented logical rules and the rule > framework. In particular, I felt that rules dealing with LOfilter are > not > able to handle complicated boolean expressions. I would like to share > suggestions to improve handling of boolean expressions in LOFilter to > enable > better optimization. > > 1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into > two by > "AND". However it will not be able to split LOFilter if the top level > operator is "OR". For example: > > *ex script:* > A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int); > B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int); > C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int); > J1 = JOIN B by b1, C by c1; > J2 = JOIN J1 by $0, A by a1; > D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);* > explain D; > > In the above example current rule is not able to any filter condition > across > any join as it contains columns from all branches (inputs). But if we > convert this expression into "Conjunctive Normal Form" (CNF) then we > would > be able to push filter condition c1< 10 and c2 == 5 below both join > conditions. Here is the CNF expression for highlighted line: > > ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) ) > > *Suggestions:* It would be a good idea to convert LOFilter's boolean > expression into CNF, it would then be easy to push parts (conjuncts) of > the > LOFilter boolean expression selectively. > > I have started thinking about the design for implementing this > conversion > (arbitrary boolean expression to CNF) and would appreciate any feedback > or > ideas. > > Thanks! > Swati > > >