Paul Rogers created IMPALA-7831:
-----------------------------------
Summary: Revisit expression rewriting integration with planner
Key: IMPALA-7831
URL: https://issues.apache.org/jira/browse/IMPALA-7831
Project: IMPALA
Issue Type: Improvement
Components: Frontend
Affects Versions: Impala 3.0
Reporter: Paul Rogers
Assignee: Paul Rogers
The planner performs expression rewriting. It appears that the rewrite engine
was added late in planner development, as an add-on step in {{AnalysisContext}}
after we create the plan. Since that time, it appears that a number of fixes
and patches have been applied to work around the inevitable bugs that resulted
from this placement of the logic.
At present, the planner flow, with rewrites, is:
* Analyze the entire query
* Assign WHERE clause "conjuncts" to scan nodes, etc.
* Cerate theĀ full plan
* Rewrite the SELECT, WHERE, HAVING and GROUP BY clauses
* Throw away the plan create above and create a new one
This ticket proposes to adjust the flow to incorporate rewrites earlier in the
process, allowing the planner to make a single pass over the query. (Which will
solve a number of bugs described in associated tickets.)
h4. Background
The above logic evolved because of a timing issue: once we assign conjuncts, we
have plan nodes that point to the original WHERE clause expressions. We later
rewrite these, but we do so by throwing away the original nodes, replacing them
with new ones. Since the scan and other nodes still have a pointer to the old
version, the rewrites can have no effect.
To work around this, the code throws away that original plan and replans using
the new, rewritten nodes.
This then creates an interesting issue. We do the full analysis (and plan)
because we need the column bindings in order to do the rewrite. Since
plan/analysis is implemented as a single black box, rewrites can't be done
before planning (no column binding yet) so must be done after (column bindings
available, but so is the entire plan.)
Some expression nodes have incomplete implementations. For example, {{X BETWEEN
Y AND Z}} does not compute a cost (because it is a "virtual" node: it does not
exist at run time, having been rewritten to {{Y <= X AND X <= Z}}.) This means
that, not only do we throw away the first plan, that first plan was actually
wrong: it used incomplete information.
Thus, in order to get the semantic info needed for rewrites (column bindings),
we end up creating an entire plan which we must then discard and rebuild after
doing the rewrites (so the planner has the full information.)
h4. Alternative
The alternative approach is to integrate expression rewrites into the planner
process, rather than doing them from the outside so that we make only a single
pass through the planner. In particular:
* Analyze expressions to create column bindings.
* Match up SELECT and GROUP BY and other expressions (if required.) GROUP BY
points to a SELECT clause node (so it will see rewrites) rather than each
SELECT expression (which will be discarded.)
* Rewrite SELECT and WHERE clause expressions. (Bound GROUP BY expressions
will see the rewrites.)
* Complete the plan as today.
With this approach, we plan only once, and that plan has a full set of cost
information based on the rewritten expressions which the BE will execute.
The purpose of this ticket is to track this analysis and to later propose a
detailed fix.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]