Dandandan commented on issue #20324:
URL: https://github.com/apache/datafusion/issues/20324#issuecomment-3912826854

   I think a problem with the dynamic queries like Q24 that on start they are 
not selective at all (just `true`), but are pushed down one file open (like 
q24) and only can be updated once it appears in TopK (and as you say the entire 
column will be evaluated during scanning before any batch is produced by 
design).
   
   The files are scanned directly on start so many of them probably get just a 
"always true" predicate pushed down to the parquet scan and the `EventTime` 
column doesn't benefit from the selectivity of `SearchPhrase" <> ''` because it 
probably comes first (as it is smaller).
   
   > I don't think it's similar to what 
https://github.com/apache/datafusion/pull/19639 addresses - the filter here is 
actually selective? (14% passing through)
   
   Yes correct, that only helps with deciding it after the fact.
   
   Some things we could do:
   * We should detect constant `true` filters and prevent pushing them down on 
open (or more generally predicates that are not effective if we know it) => it 
makes no sense to push those down at the moment as the predicate will not be 
updated during scan. Perhaps this already would improve most of the queries?
   *  I am wondering if for those queries, a conservative heuristic would be 
also to always put dynamic filters after the static filters (regardless of 
column size), so the overhead of pushing down bad dynamic filters won't be as 
bad. It might regress some good TopK predicates though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to