Re: [I] [EPIC] Fix performance regressions when enabling parquet filter pushdown (late materialization) [datafusion]

via GitHub Sat, 14 Feb 2026 15:05:16 -0800


adriangb commented on issue #20324:
URL: https://github.com/apache/datafusion/issues/20324#issuecomment-3902753597


   > I think [#19639](https://github.com/apache/datafusion/pull/19639) is 
somewhat similar in goal to 
[apache/arrow-rs#9414](https://github.com/apache/arrow-rs/pull/9414) or do we 
expect a large difference?
   
   I haven't looked at your PR yet (very exciting!) but I think the approaches 
are indeed similar.
   I thought about it for a bit and (pending unknowns to me) the tradeoffs are 
that the arrow version can be dynamic within a file scan (useful for small 
queries?) while the DataFusion version can be dynamic between files or even 
between queries. The DataFusion version may also make it easier to use the same 
code for multiple file formats and doesn't require any API changes or 
cross-crate syncronization. We could also do a version that tracks stats across 
files / queries in DataFusion and feeds that into the arrow machinery that then 
applies it.
   
   Regarding the dynamic filter specific selectivity tracking: the difference 
with these is that we can completely disable / discard them if needed without 
loosing correctness. It might be possible to merge the efforts if we e.g. add 
`PhysicalExpr::is_discardable_filter() -> bool` or something, then the more 
general adaptive selectivity machinery can choose to discard the filter instead 
of just putting it last or something. I wonder if optimizer rules could derive 
"discardable" filters from more complex filters e.g. `col in ('large string', 
'another')` -> `AdaptiveSelectivity [ col < 'larh` and col > 'anou` ] and col 
in ('large string', 'another')`.
   
   Some other general thoughts:
   - we should take into account compute needed for calculating the filters. So 
I think units of the thresholds should be bytes/second (bytes = amount of data 
discarded, seconds = time taken to compute the filter mask).
   - We can derive statistical correlations between the filters to do things 
like group correlated filters together to avoid creating useless masks, e.g. 
`name = 'adrian' and department = 'engineering' and role ilike '%database%'` 
could be grouped into `name = 'adrian'` (first, very selective) and then 
`department = 'engineering' and role ilike '%database%'` (as a single filter 
because they are correlated).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [EPIC] Fix performance regressions when enabling parquet filter pushdown (late materialization) [datafusion]

Reply via email to