[ 
https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887923#action_12887923
 ] 

Yan Zhou commented on PIG-1399:
-------------------------------

This work is not to optimize on the generic boolean logics, but rather to 
simplify the logic expression based upon the constant values as the logical 
expression's operands. The former, e.g.,  would change an boolean expression of 
((A AND B) OR (A AND C)) to (A AND (B OR C)); while the latter will change, 
say,  (a0 > 5 and a0 > 7) to (a0 > 7).  It is, therefore, up to the query 
writer to optimize his/her boolean logic, probably through use of some other 
tools.

The algorithm works in a series of steps in order:

1) a constant expression evaluation visitor that will evaluate the constant 
expressions. It works by traversing the expression tree in a bottom-up manner 
and evaluate all subexpressions that have all constant subexpressions. All 
results from constant children are pushed to a stack for the parent to digest 
for its own evaluation. Any non-constant expression will push a null to the 
stack and consequently will cause all of its ancestors not to be evaluated.
For simplicity, only constant binary expressions and constant unary expressions 
and evaluated. More complex expressions are not evaluated at this moment. For 
UDF, this evaluation is not planned at all for fear of possible stateful 
consequences resulting from the original evaluations;

2) A NOT conversion visitor that will traverse the expression tree in a 
depth-first manner with post-order handling. A status of negativity for a NOT 
expression is recorded in the depth-first traversal before subtree traversal 
and reversed after traversing the subtree. All "reversible" expressions is 
replaced by its negated counterpart for negative negativity. Currently equality 
ops, and its non-equality couter part, all range comparisons, logical AND and 
OR are reversible.
   Notably missing is the "is null" for lack of a "is not null" base expression;

3) A DNF plan is generated, through a helper DNFPlanGenerator visitor class, 
whose disjunctions are either of OrExpression or a new DNFExpression with type 
of "OR", whose conjunctions are either of AndExpression or the new 
DNFExpression with type of "AND". 
   The introduction of the new DNFExpression, which extends LogiclExpression, 
is to support multiple children (vs. the two children in a BinaryExpression) to 
facilitate the processing of multiple children
   of an "OR" or "AND" operator due to the commutative property of the two 
operators. The leaves of the DNF are of a new LogiclExpressionProxy type that 
extends the LogicalExpression.
   This new type is to be used as a proxy toward the original leaf expression 
in the original filter plan. The purpose is to track how often an original 
expression has been put in 
   the DNF plan as result of the normalizing process. Consequently, a 
DNFSplitCounter member is added to the LogicalExpression, which is incremented 
once a new proxy is created
   on the original expression. Due to the potentially exponential growth of the 
DNF plan, and the nonlinear complexity to trim the DNF plan (see 4 below), the 
size of the DNF plan is limited to 100 nodes beyind which the simplification
   beyond step 2) are just skipped;

4) Then the DNF plan is trimmed according to the inferrence rules between the 
operands of the conjunctions first, and then between the operands of the 
disjuction in the DNF plan.
   If a leaf is trimmed, the counter, DNFSpliCounter, of the source of the 
proxy will be decremented. Basically, the DNF plan is used as a utility to 
determine if an original leaf
   expression can be trimmed from the original filter plan or not. If all 
proxies of the original leaf expression have been trimmed from the DNF plan, 
the original leaf expression can be trimmed from the original plan then.
   The point is that the DNF plan is not intended to replace the original filer 
plan since the DNF plan in general tends to be more expensive to evaluate than 
the original filter plan.

5) The original filter plan is traversed in a bottom-up manner so that if a 
leaf's DNFSpliCounter is zero, which means all of its proxies on DNF has been 
trimmed, the leaf will be trimmed.
   For nonleafs of "AND" or "OR" expressions, if one child survives, the child 
will be relinked to the predecessor(s). If either or both children are trimmed, 
the nonleaf will be trimmed
   too. If the whole new filter plan is empty, the filter operator will be 
removed from the logical plan too.

Using a example of "B = filter A by NOT((a0 > 1) or (a1 < 3 and a0 >3+2))":

After 1), the filter plan becomes "NOT((a0 > 1) or (a1 < 3 and a0 >5))";

After 2), the filter plan becomes "(a0 <= 1) AND ((a1 >= 3) OR (a0 <= 5))";

After 3), the DNF plan is "((a0 <= 1) AND (a1 >= 3)) OR ((a0 <= 1) AND (a0 <= 
5))";

After 4), the DNF plan becomes "a0 <= 1";

After 5), the filter plan becomes "a0 <= 1".


> Logical Optimizer: Expression optimizor rule
> --------------------------------------------
>
>                 Key: PIG-1399
>                 URL: https://issues.apache.org/jira/browse/PIG-1399
>             Project: Pig
>          Issue Type: Sub-task
>          Components: impl
>    Affects Versions: 0.7.0
>            Reporter: Daniel Dai
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>
>
> We can optimize expression in several ways:
> 1. Constant pre-calculation
> Example:
> B = filter A by a0 > 5+7;
> => B = filter A by a0 > 12;
> 2. Boolean expression optimization
> Example:
> B = filter A by not (not(a0>5) or a>10);
> => B = filter A by a0>5 and a<=10;

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to