Dandandan edited a comment on issue #2015:
URL:
https://github.com/apache/arrow-datafusion/issues/2015#issuecomment-1068016853
> **Is your feature request related to a problem or challenge? Please
describe what you are trying to do.**
> I found there are some filter was pushed down, but it's still in the
logicplan.
>
> ```sql
> explain select c1, c2 from test where c3 = true and c2 = 0.000001;
>
>
+---------------+-------------------------------------------------------------------------------------------------------------------------------------+
> | plan_type | plan
|
>
+---------------+-------------------------------------------------------------------------------------------------------------------------------------+
> | logical_plan | Projection: #test.c1, #test.c2
|
> | | Filter: #test.c3
|
> | | Filter: #test.c2 = Float64(0.000001)
|
> | | TableScan: test projection=Some([0, 1, 2]),
filters=[#test.c3, #test.c2 = Float64(0.000001)]
|
> | physical_plan | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]
|
> | | CoalesceBatchesExec: target_batch_size=4096
|
> | | FilterExec: c3@2
|
> | | CoalesceBatchesExec: target_batch_size=4096
|
> | | FilterExec: c2@1 = 0.000001
|
> | | RepartitionExec:
partitioning=RoundRobinBatch(8)
|
> | | CsvExec:
files=[/home/jakevin/code/arrow-datafusion/datafusion/tests/aggregate_simple.csv],
has_header=true, limit=None |
> | |
|
>
+---------------+-------------------------------------------------------------------------------------------------------------------------------------+
> ```
>
> ` Filter: #test.c2 = Float64(0.000001)` is included in the `TableScan`
>
> Maybe we can eliminate it. How do you think about it ? @alamb @Dandandan
>
> **Describe the solution you'd like**
> Correct the `push down filter` in the optimizer
>
> **Describe alternatives you've considered**
>
> **Additional context**
> In PG
> ```sql
> create table scientist (id integer, firstname varchar(100), lastname
varchar(100));
> insert into scientist (id, firstname, lastname) values (1, 'albert',
'einstein');
> insert into scientist (id, firstname, lastname) values (2, 'isaac',
'newton');
> insert into scientist (id, firstname, lastname) values (3, 'marie',
'curie');
> explain select id, firstname from scientist where lastname = 'einstein'
and id > 2;
> ```
>
> ```
> explain select id, firstname from scientist where lastname = 'einstein'
and id > 2;
> QUERY PLAN
> ----------------------------------------------------------------
> Seq Scan on scientist (cost=0.00..12.55 rows=1 width=222)
> Filter: ((id > 2) AND ((lastname)::text = 'einstein'::text))
> ```
Pushing down the filter to table scan doesn't mean the filter can be removed
from the plan. For example, parquet data doesn't support exact filtering: it
returns more rows than those matching the predicates based onstatistics in the
file. Another filter needs to filter out those rows not matching the predicate.
PostgreSQL has a different way of scanning data: it will read rows one by
one and apply the filter directly to the rows.
What can be done I think in terms of optimizing is combining the two filters
in one `FilterExec`: (`FilterExec: c2@1 = 0.000001 AND c3@2`).
I guess that could be in some cases faster than having two filters.
Another optimization could be to apply the most selective filter expressjon
first and/or the most expensive expression (such as a complex regex or
case-when expression) last.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]