avantgardnerio opened a new pull request, #5362:
URL: https://github.com/apache/arrow-datafusion/pull/5362
# Which issue does this PR close?
Closes #5357.
# Rationale for this change
If the planner/optimizer has information about how a table is / can be
sorted, then it opens up the ability to push more predicates down to the
TableProvider.
For example, TPC-H query 9 might perform far better, since the table could
be naturally ordered on the primary key `(PS_PARTKEY, PS_SUPPKEY)`:
```
SELECT
nation,
o_year,
SUM(amount) AS sum_profit
FROM
(
SELECT
n_name AS nation,
YEAR(o_orderdate) AS o_year,
l_extendedprice * (1 - l_discount) - ps_supplycost *
l_quantity AS amount
FROM
part,
supplier,
lineitem,
partsupp,
orders,
nation
WHERE
s_suppkey = l_suppkey
AND ps_suppkey = l_suppkey
AND ps_partkey = l_partkey
```
after the subquery is de-correlated, it will be trying to join on the
primary key, so it will likely:
```
EquiJoin( ps_suppkey = l_suppkey and ps_partkey = l_partkey)
Sort(PS_PARTKEY, PS_SUPPKEY)
TableScan [filter=s_suppkey]
```
When the filter could actually filter far more rows using both columns, and
the sort could be avoided entirely.
# What changes are included in this PR?
An interface change to allow `TableProviders` to inform the planner about
single or multi-column, primary or secondary indexes, so that a future
(fast-follow) PR can push predicates down to filter & sort in the
`TableProvider` automatically.
# Are these changes tested?
No, it's an interface change.. though maybe I could test that?
# Are there any user-facing changes?
No, existing `TableProvider`s should be fine.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]