Paul Rogers created DRILL-7558:
----------------------------------
Summary: Generalize filter push-down planner phase
Key: DRILL-7558
URL: https://issues.apache.org/jira/browse/DRILL-7558
Project: Apache Drill
Issue Type: Improvement
Affects Versions: 1.18.0
Reporter: Paul Rogers
Assignee: Paul Rogers
Fix For: 1.18.0
DRILL-7458 provides a base framework for storage plugins, including a
simplified filter push-down mechanism. [~volodymyr] notes that it may be *too*
simple:
{quote}
What about the case when this rule was applied for one filter, but planner at
some point pushed another filter above the scan, for example, if we have such
case:
{code}
Filter(a=2)
Join(t1.b=t2.b, type=inner)
Filter(b=3)
Scan(t1)
Scan(t2)
{code}
Filter b=3 will be pushed into scan, planner will push filter above join:
{code}
Join(t1.b=t2.b, type=inner)
Filter(a=2)
Scan(t1, b=3)
Scan(t2)
{code}
In this case, check whether filter was pushed is not enough.
{quote}
Drill divides planning into a number of *phases*, each defined by a set of
*rules*. Most storage plugins perform filter push-down during the physical
planning stage. However, by this point, Drill has already decided on the degree
of parallelism: it is too late to use filter push-down to set the degree of
parallelism. Yet, if using something like a REST API, we want to use filters to
help us shard the query (that is, to set the degree of parallelism.)
DRILL-7458 performs filter push-down at *logical* planning time to work around
the above limitation. (In Drill, there are three different phases that could be
considered the logical phase, depending on which planning options are set to
control Calcite.)
[~volodymyr] points out that the the logical plan phase may be wrong because it
will perform rewrites of the type he cited.
Thus, we need to research where to insert filter push down. It must come:
* After rewrites of the kind described above.
* After join equivalence computations. (See DRILL-7556.)
* Before the decision is made about the number of minor fragments.
The goal of this ticket is to either:
* Research to identify an existing phase which satisfies these requirements, or
* Create a new phase.
Due to the way Calcite works, it is not a good idea to have a single phase
handle two tasks that depend on one another. That is, we cannot combine filter
push down in a phase which defines the filters, nor can we add filter push-down
in a phase that choose parallelism.
Background: Calcite is a rule-based query planner inspired by
[Volcano|https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf].
The above issue is a flaw with rule-based planners and was identified as early
as the [Cascades query framework
paper|https://www.csd.uoc.gr/~hy460/pdf/CascadesFrameworkForQueryOptimization.pdf]
which was the follow-up to Volcano.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)