Hi Yan Thanks for your prompt reply. I did not understand your statement “C1 and C2, or their equivalent, above JOIN can be easily figured out without resorting to CNF”.
Consider a LOFilter above a LOJoin. The predicate of LOFilter: ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5) The schema for LOJoin: A = (a1:int,a2:int,a3:int); B = (b1:int,b2:int,b3:int); C = (c1:int,c2:int,c3:int); After CNF: ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) ) Now we can push ( (c1 < 10) OR (c2 == 5) ) above the JOIN (in the branch leading up to the source C) while ( (a3+b3 > 10) OR (c2 ==5) ) stays put below the JOIN. Please let me know if there is a way of doing the above optimization without converting the original expression to CNF. Thanks, Swati On Mon, Jul 12, 2010 at 4:26 PM, Yan Zhou <y...@yahoo-inc.com> wrote: > I see. There looks like some disconnect about "Scenario 1". To me, all > filtering logics that can be pushed above JOIN can be figured out > without use of CNF, which is scenario 1; while CNF helps to derive the > filtering logic after (or, in your example, below) JOIN, which is > Scenario 2. > > > > In your example, C1 and C2, or their equivalent, above JOIN can be > easily figured out without resorting to CNF; C3 may have to be figured > out with CNF, but evaluation cost of the post-Join filtering logic thus > generated may not be cheaper than the original one before pushing up. > > > > In summary, if we want to support scenario 2(and 1), we should use CNF; > if we JUST want to support scenario 1, which will push up all possible > filters closer to source and have all benefits on pruned I/O, we should > not use CNF. > > > > Thanks, > > > > Yan > > > > -----Original Message----- > From: Swati Jain [mailto:swat...@aggiemail.usu.edu] > Sent: Monday, July 12, 2010 4:04 PM > To: pig-dev@hadoop.apache.org > Subject: PIG Logical Optimization: Use CNF in SplitFilter > > > > Yan, > > > > What I meant in my last email was that scenario 2 optimizations would > lead > > to more opportunities for scenario 1 kind of optimizations. > > > > Consider the conjunct list [C1;C2;C3] as the source of a JOIN. > > > > (a) Suppose none of these are computable on a join input, in this case > we > > retain the original expression and discard the CNF. > > > > (b) Suppose C1 is computable on join input J1 and C2 is computable on > join > > input J2 but C3 requires a combination of both join inputs. In this > case, we > > push C1 above J1, C2 above J2 and leave C3 as is below the JOIN. Note > that > > C1 and C2 may be further pushed up (with additional iterations of the > > optimizer). If they are now the source of single input operators, it is > > similar to scenario 1. > > > > Thanks, > > Swati > > > > > > On Mon, Jul 12, 2010 at 3:14 PM, Yan Zhou <y...@yahoo-inc.com> wrote: > > > > > Hopefully by this week. I'm still in the debugging phase of the work. > > > While you are welcome to reuse some of my algorithms, I doubt you can > reuse > > > the code as much as you want. It's basically for my DNF use. You might > need > > > to factor out some general codes which you can find > > > > > > reusable. > > > > > > > > > > > > I fully understand the I/O benefits as I put in my first message. And > it is > > > classified as "Scenario 1". There is no doubt that it should be > considered > > > as part of your work. However, for this, CNF is not necessary. > > > > > > > > > > > > For scenario 2, the benefits will be from less in-core logical > expression > > > evaluation costs and no I/O benefits as I can see. And use of CNF may > or may > > > not lead to cheaper evaluations as the example in my first message > shows. In > > > other words, after use of CNF, you should > > > > > > compare the eval cost with that in the original expression eval before > > > deciding either the CNF or the original form should be evaluated. > > > > > > > > > > > > Please let me know if I miss any of your points. > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Yan > > > ------------------------------ > > > > > > *From:* Swati Jain [mailto:swat...@aggiemail.usu.edu] > > > *Sent:* Monday, July 12, 2010 11:52 AM > > > > > > *To:* Yan Zhou > > > *Cc:* pig-dev@hadoop.apache.org > > > *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter > > > > > > > > > > > > I was wondering if you are not going to check in your patch soon then > it > > > would be great if you could share it with me. I believe I might be > able to > > > reuse some of your (utility) functionality directly or get some ideas. > > > > > > About your cost-benefit question: > > > 1) I will control the complexity of CNF conversion by providing a > > > configurable threshold value which will limit the OR-nesting. > > > 2) One benefit of this conversion is that it will allow pushing parts > of a > > > filter (conjuncts) across the joins which is not happening in the > current > > > PushUpFilter optimization. Moreover, it may result in a cascading > effect to > > > push the conjuncts below other operators by other rules that may be > fired as > > > a result. The benefit from this is really data dependent, but in > big-data > > > workloads, any kind of predicate pushdown may eventually lead to big > savings > > > in amount of data read or amount of data transfered/shuffled across > the > > > network (I need to understand the LogicalPlan to PhysicalPlan > conversion > > > better to give concrete examples). > > > > > > Thanks! > > > Swati > > > > > > On Mon, Jul 12, 2010 at 10:36 AM, Yan Zhou <y...@yahoo-inc.com> wrote: > > > > > > Yes, I already implemented the "NOT push down" upfront, so you do not > need > > > to do that. > > > > > > > > > > > > The support of CNF will probably be the most difficulty part. But as I > > > mentioned last time, you should compare the cost after the trimming > CNF to > > > get the post-split filtering logic. Given the complexity of > manipulating CNF > > > and undetermined benefits, I am not sure it should be in scope at this > > > moment or not. > > > > > > > > > > > > To handle CNF, I think it's a good idea to create a new plan and > connect > > > the nodes in the new plan to the base plan as you envisioned. In my > changes, > > > which uses DNF instead of CNF but processing is similar otherwise, I > use a > > > LogicalExpressionProxy, which contains a "source" member that is just > the > > > node in the original plan, to link the nodes in the new plan and old > plan. > > > The original LogicalExpression is enhanced with a counter to trace > the # of > > > proxies of the original nodes since normal form creation will "spread" > the > > > nodes in the original tree across many normalized nodes. The benefit, > aside > > > from not setting the plan, is that the original expression is trimmed > > > according to the processing results from DNF; while DNF is created > > > separately and as a kinda utility so that complex features can be > used. In > > > my changes, I used multiple-child tree in DNF while not changing the > > > original binary expression tree structure. Another benefit is that the > > > original tree is kept as much as it is at the start, i.e., I do not > attempt > > > to optimize its overall structure beyond trimming based upon the > > > simplification logics. (I also control the size of DNF to 100 nodes.) > The > > > down side of this is added complexity. > > > > > > > > > > > > But in your case, for scenario 2 which is the whole point to use CNF, > you > > > would need to change the original expression tree structurally beyond > > > trimming for post-split filtering logic. The other benefit of using > > > multiple-child expression is depending upon if you plan to support > such > > > expression to replace current binary tree > > > > > > in the final plan. Even though I think it's a good idea to support > that, > > > but it is not in my scope now. > > > > > > > > > > > > I'll add my algorithm details soon to my jira. Please take a look and > > > comment as you see appropriate. > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Yan > > > > > > > > > > > > > > > ------------------------------ > > > > > > *From:* Swati Jain [mailto:swat...@aggiemail.usu.edu] > > > *Sent:* Friday, July 09, 2010 11:00 PM > > > *To:* Yan Zhou > > > *Cc:* pig-dev@hadoop.apache.org > > > *Subject:* Re: PIG Logical Optimization: Use CNF in SplitFilter > > > > > > > > > > > > Hi Yan, > > > > > > I agree that the first scenario (filter logic applied to individual > input > > > sources) doesn't need conversion to CNF and that it will be a good > idea to > > > add CNF functionality for the second scenario. I was also planning to > > > provide a configurable threshold value to control the complexity of > CNF > > > conversion. > > > > > > As part of the above, I wrote a utility to push the "NOT" operator in > > > predicates below the "AND" and "OR" operators (Scenario 2 in > PIG-1399). I am > > > considering making this utility to push NOT a separate rule in itself. > Lmk > > > if you have already implemented this. > > > > > > While implementing this utility I am facing some trouble in > maintaining > > > OperatorPlan consistent as I rewrite the expression. This is because > each > > > operator is referencing the main filter logical plan. Here is my > current > > > approach of implementation: > > > > > > 1. I am creating a new LogicalExpressionPlan for the converted boolean > > > expression. > > > 2. I am creating new logical expressions while pushing the NOT > operation, > > > converting AND into OR, OR into AND eliminating NOT NOT pairs. > > > 3. However, I am having trouble updating the LogicalExpressionPlan if > it > > > reaches the base case ( i.e. root operator is not NOT,AND,OR). > > > > > > D = Filter J2 by ( (c2 == 5) OR ( NOT( (c1 < 10) AND (c3+b3 > 10 ) ) ) > ); > > > > > > In the above, for example, I am not sure how to integrate base > expression > > > (c2 == 5) into the new LogicalExpressionPlan. There is no routine to > set the > > > plan for a given operator and its children. Also, there is currently > no way > > > to deepCopy an expression into a new OperatorPlan. It would be great > if you > > > could give me some suggestions on what approach to take for this. > > > > > > One approach I thought of is to visit the base expression and create > and > > > connect the base expression to the LogicalExpressionPlan as I visit > it. > > > > > > Thoughts? > > > Swati > > > > > > ps: About your other point regarding binary vs multi way trees, the > way I > > > am creating the normal form is a list of conjuncts, where each > conjunct is a > > > list of disjuncts. This is logically similar to a multi waytree. > However, > > > the current modeling of boolean expressions (modeled as binary > expressions) > > > requires a conversion back to the binary tree model when adding back > to the > > > main plan. > > > > > > On Tue, Jul 6, 2010 at 12:46 PM, Yan Zhou <y...@yahoo-inc.com> wrote: > > > > > > Swati, > > > > > > I happen to be working on the logical expression simplification effort > > > (https://issues.apache.org/jira/browse/PIG-1399), but not on the > filter > > > split front. So I guess our interests will have some overlaps. > > > > > > I think the filter logic split problem can be divided into 2 parts: > > > 1) the filtering logic that can be applied to individual input > sources; > > > and 2) the filtering logic that has to be applied when merged(or > joined) > > > inputs are processed. > > > > > > The benefits for 1) are any benefits if the underlying storage > supports > > > predicate pushdown; plus the memory/CPU savings by PIG for not > > > processing the unqualified rows. > > > > > > For 2), the purpose is not paying higher evaluation costs than > > > necessary. > > > > > > For 1), no normal form is necessary. The original logical expression > > > tree > > > can be trimmed off any sub-expressions that are not constants nor just > > > from a particular input source. The complexity is linear with the tree > > > size; while the use of normal form could potentially lead to > exponential > > > complexity. The difficulty with this approach is how to generate the > > > filtering logic for 2); while CNF can be used to easily figure out the > > > logic for 2). However, the exact logic in 2) might not be cheaper to > > > evaluate than the original logical expression. An example is "Filter > J2 > > > by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2) > the > > > filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND > > > ((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost > > > will be 5 logical evaluations (3 OR plus 2 AND), which could be > reduced > > > to 4, compared with 3 logical evaluations in the original form. > > > > > > In summary, if only 1) is desired, the tree trimming is enough. If 2) > is > > > desired too, then CNF could be used but its complexity should be > > > controlled and the cost of the filtering logic evaluation in 2) should > > > be computed and compared with the original expression evaluation cost. > > > Further optimization is possible in this direction. > > > > > > Another potential optimization to consider is to support logical > > > expression tree of multiple children vs. the binary tree after taking > > > into consideration of the commutative property of OR and AND > operations. > > > The advantages are less tree traversal costs and easier to change the > > > evaluation ordering within the same sub-tree in order to maximize the > > > possibilities to short-cut the evaluations. Although this is general > for > > > all logical expressions, this tends to be more suitable for normal > form > > > handlings as normal forms group the sub-expressions by the operators > > > that act on the sub-expressions. > > > > > > Thanks, > > > > > > Yan > > > > > > > > > -----Original Message----- > > > From: Swati Jain [mailto:swat...@aggiemail.usu.edu] > > > Sent: Monday, July 05, 2010 2:34 AM > > > To: pig-dev@hadoop.apache.org > > > Cc: Daniel Dai > > > Subject: PIG Logical Optimization: Use CNF in SplitFilter > > > > > > Hi, > > > > > > I am interested in implementing logical optimization rules and to > target > > > this I have studied currently implemented logical rules and the rule > > > framework. In particular, I felt that rules dealing with LOfilter are > > > not > > > able to handle complicated boolean expressions. I would like to share > > > suggestions to improve handling of boolean expressions in LOFilter to > > > enable > > > better optimization. > > > > > > 1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into > > > two by > > > "AND". However it will not be able to split LOFilter if the top level > > > operator is "OR". For example: > > > > > > *ex script:* > > > A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int); > > > B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int); > > > C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int); > > > J1 = JOIN B by b1, C by c1; > > > J2 = JOIN J1 by $0, A by a1; > > > D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);* > > > explain D; > > > > > > In the above example current rule is not able to any filter condition > > > across > > > any join as it contains columns from all branches (inputs). But if we > > > convert this expression into "Conjunctive Normal Form" (CNF) then we > > > would > > > be able to push filter condition c1< 10 and c2 == 5 below both join > > > conditions. Here is the CNF expression for highlighted line: > > > > > > ( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) ) > > > > > > *Suggestions:* It would be a good idea to convert LOFilter's boolean > > > expression into CNF, it would then be easy to push parts (conjuncts) > of > > > the > > > LOFilter boolean expression selectively. > > > > > > I have started thinking about the design for implementing this > > > conversion > > > (arbitrary boolean expression to CNF) and would appreciate any > feedback > > > or > > > ideas. > > > > > > Thanks! > > > Swati > > > > > > > > > > > > > > > > >