[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #2015: Eliminate some filter which was pushed down

GitBox Tue, 15 Mar 2022 06:56:13 -0700


Dandandan edited a comment on issue #2015:
URL: 
https://github.com/apache/arrow-datafusion/issues/2015#issuecomment-1068016853



   > **Is your feature request related to a problem or challenge? Please 
describe what you are trying to do.**
   > I found there are some filter was pushed down, but it's still in the 
logicplan.
   > 
   > ```sql
   >  explain select c1, c2 from test where c3 = true and c2 = 0.000001;
   > 
   > 
+---------------+-------------------------------------------------------------------------------------------------------------------------------------+
   > | plan_type     | plan                                                     
                                                                           |
   > 
+---------------+-------------------------------------------------------------------------------------------------------------------------------------+
   > | logical_plan  | Projection: #test.c1, #test.c2                           
                                                                           |
   > |               |   Filter: #test.c3                                       
                                                                           |
   > |               |     Filter: #test.c2 = Float64(0.000001)                 
                                                                           |
   > |               |       TableScan: test projection=Some([0, 1, 2]), 
filters=[#test.c3, #test.c2 = Float64(0.000001)]                                
  |
   > | physical_plan | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2]            
                                                                           |
   > |               |   CoalesceBatchesExec: target_batch_size=4096            
                                                                           |
   > |               |     FilterExec: c3@2                                     
                                                                           |
   > |               |       CoalesceBatchesExec: target_batch_size=4096        
                                                                           |
   > |               |         FilterExec: c2@1 = 0.000001                      
                                                                           |
   > |               |           RepartitionExec: 
partitioning=RoundRobinBatch(8)                                                 
                         |
   > |               |             CsvExec: 
files=[/home/jakevin/code/arrow-datafusion/datafusion/tests/aggregate_simple.csv],
 has_header=true, limit=None |
   > |               |                                                          
                                                                           |
   > 
+---------------+-------------------------------------------------------------------------------------------------------------------------------------+
   > ```
   > 
   > ` Filter: #test.c2 = Float64(0.000001)`   is included in the `TableScan`
   > 
   > Maybe we can eliminate it. How do you think about it ? @alamb @Dandandan 
   > 
   > **Describe the solution you'd like**
   > Correct the `push down filter` in the optimizer
   > 
   > **Describe alternatives you've considered**
   > 
   > **Additional context**
   > In PG
   > ```sql
   > create table scientist (id integer, firstname varchar(100), lastname 
varchar(100));
   > insert into scientist (id, firstname, lastname) values (1, 'albert', 
'einstein');
   > insert into scientist (id, firstname, lastname) values (2, 'isaac', 
'newton');
   > insert into scientist (id, firstname, lastname) values (3, 'marie', 
'curie');
   > explain select id, firstname from scientist where lastname = 'einstein' 
and id > 2;
   > ```
   > 
   > ```
   >  explain select id, firstname from scientist where lastname = 'einstein' 
and id > 2;
   >                            QUERY PLAN
   > ----------------------------------------------------------------
   >  Seq Scan on scientist  (cost=0.00..12.55 rows=1 width=222)
   >    Filter: ((id > 2) AND ((lastname)::text = 'einstein'::text))
   > ```
   
   Pushing down the filter to table scan doesn't mean the filter can be removed 
from the plan. For example, parquet data doesn't support exact filtering: it 
returns more rows than those matching the predicates based onstatistics in the 
file. Another filter needs to filter out those rows not matching the predicate.
   
   PostgreSQL has a different way of scanning data: it will read rows one by 
one and apply the filter directly to the rows.
   
   What can be done I think in terms of optimizing is combining the two filters 
in one `FilterExec`: (`FilterExec: c2@1 = 0.000001 AND c3@2`).
    I guess that could be in some cases faster than having two filters.
   
   Another optimization could be to apply the most selective filter expressjon 
first and/or the most expensive expression (such as a complex regex or 
case-when expression) last.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan edited a comment on issue #2015: Eliminate some filter which was pushed down

Reply via email to