adriangb commented on PR #19639:
URL: https://github.com/apache/datafusion/pull/19639#issuecomment-3715509979

   Thanks so much @sdf-jkl, that's super useful info!
   
   The ClickHouse resources seem to be more in line with parquet row group 
pruning using statistics, which happens before this process. What we are 
talking about here is more so how to process the filtering during the scan, 
which would be after the `PREWHERE` / row group stats.
   
   One long term vision for this is that we could "seed" the filter 
sensitivities (instead of assuming they're all unknown). That's basically what 
you are proposing in `Before seeing your PR and comments in 
https://github.com/apache/datafusion/issues/3463 I was thinking about using 
more simple heuristics for sorting predicates.` We discussed that a bit in 
https://github.com/apache/datafusion/issues/3463#issuecomment-3708382916. TLDR 
is I think yes using column statistics, sizes, a global cache, etc. would be 
better than making no assumptions as this PR currently does, but we can improve 
that later. My goal for now is that performance is ~ no worse than without 
filter pushdown when there are no selective filters but that when there are 
selective filters we can take advantage of them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to