[ https://issues.apache.org/jira/browse/PIG-1399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12887923#action_12887923 ]
Yan Zhou commented on PIG-1399: ------------------------------- This work is not to optimize on the generic boolean logics, but rather to simplify the logic expression based upon the constant values as the logical expression's operands. The former, e.g., would change an boolean expression of ((A AND B) OR (A AND C)) to (A AND (B OR C)); while the latter will change, say, (a0 > 5 and a0 > 7) to (a0 > 7). It is, therefore, up to the query writer to optimize his/her boolean logic, probably through use of some other tools. The algorithm works in a series of steps in order: 1) a constant expression evaluation visitor that will evaluate the constant expressions. It works by traversing the expression tree in a bottom-up manner and evaluate all subexpressions that have all constant subexpressions. All results from constant children are pushed to a stack for the parent to digest for its own evaluation. Any non-constant expression will push a null to the stack and consequently will cause all of its ancestors not to be evaluated. For simplicity, only constant binary expressions and constant unary expressions and evaluated. More complex expressions are not evaluated at this moment. For UDF, this evaluation is not planned at all for fear of possible stateful consequences resulting from the original evaluations; 2) A NOT conversion visitor that will traverse the expression tree in a depth-first manner with post-order handling. A status of negativity for a NOT expression is recorded in the depth-first traversal before subtree traversal and reversed after traversing the subtree. All "reversible" expressions is replaced by its negated counterpart for negative negativity. Currently equality ops, and its non-equality couter part, all range comparisons, logical AND and OR are reversible. Notably missing is the "is null" for lack of a "is not null" base expression; 3) A DNF plan is generated, through a helper DNFPlanGenerator visitor class, whose disjunctions are either of OrExpression or a new DNFExpression with type of "OR", whose conjunctions are either of AndExpression or the new DNFExpression with type of "AND". The introduction of the new DNFExpression, which extends LogiclExpression, is to support multiple children (vs. the two children in a BinaryExpression) to facilitate the processing of multiple children of an "OR" or "AND" operator due to the commutative property of the two operators. The leaves of the DNF are of a new LogiclExpressionProxy type that extends the LogicalExpression. This new type is to be used as a proxy toward the original leaf expression in the original filter plan. The purpose is to track how often an original expression has been put in the DNF plan as result of the normalizing process. Consequently, a DNFSplitCounter member is added to the LogicalExpression, which is incremented once a new proxy is created on the original expression. Due to the potentially exponential growth of the DNF plan, and the nonlinear complexity to trim the DNF plan (see 4 below), the size of the DNF plan is limited to 100 nodes beyind which the simplification beyond step 2) are just skipped; 4) Then the DNF plan is trimmed according to the inferrence rules between the operands of the conjunctions first, and then between the operands of the disjuction in the DNF plan. If a leaf is trimmed, the counter, DNFSpliCounter, of the source of the proxy will be decremented. Basically, the DNF plan is used as a utility to determine if an original leaf expression can be trimmed from the original filter plan or not. If all proxies of the original leaf expression have been trimmed from the DNF plan, the original leaf expression can be trimmed from the original plan then. The point is that the DNF plan is not intended to replace the original filer plan since the DNF plan in general tends to be more expensive to evaluate than the original filter plan. 5) The original filter plan is traversed in a bottom-up manner so that if a leaf's DNFSpliCounter is zero, which means all of its proxies on DNF has been trimmed, the leaf will be trimmed. For nonleafs of "AND" or "OR" expressions, if one child survives, the child will be relinked to the predecessor(s). If either or both children are trimmed, the nonleaf will be trimmed too. If the whole new filter plan is empty, the filter operator will be removed from the logical plan too. Using a example of "B = filter A by NOT((a0 > 1) or (a1 < 3 and a0 >3+2))": After 1), the filter plan becomes "NOT((a0 > 1) or (a1 < 3 and a0 >5))"; After 2), the filter plan becomes "(a0 <= 1) AND ((a1 >= 3) OR (a0 <= 5))"; After 3), the DNF plan is "((a0 <= 1) AND (a1 >= 3)) OR ((a0 <= 1) AND (a0 <= 5))"; After 4), the DNF plan becomes "a0 <= 1"; After 5), the filter plan becomes "a0 <= 1". > Logical Optimizer: Expression optimizor rule > -------------------------------------------- > > Key: PIG-1399 > URL: https://issues.apache.org/jira/browse/PIG-1399 > Project: Pig > Issue Type: Sub-task > Components: impl > Affects Versions: 0.7.0 > Reporter: Daniel Dai > Assignee: Yan Zhou > Fix For: 0.8.0 > > > We can optimize expression in several ways: > 1. Constant pre-calculation > Example: > B = filter A by a0 > 5+7; > => B = filter A by a0 > 12; > 2. Boolean expression optimization > Example: > B = filter A by not (not(a0>5) or a>10); > => B = filter A by a0>5 and a<=10; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.