Re: [I] [EPIC] Fix performance regressions when enabling parquet filter pushdown (late materialization) [datafusion]

via GitHub Sun, 15 Feb 2026 12:08:52 -0800


adriangb commented on issue #20324:
URL: https://github.com/apache/datafusion/issues/20324#issuecomment-3905097162


   > See [#20160 
(comment)](https://github.com/apache/datafusion/pull/20160#issuecomment-3905053370)
   > 
   > I think approaches to adaptiveness/selectivity tracking also need to work 
_during_ file scan - otherwise we're evaluating / reading the columns without 
using the selectivity tracking very much.
   > 
   > We probably need to integrate more with the parquet reader (i.e. implement 
some config / hooks / in arrow-rs) so it knows it can stop decoding the column 
that will yield only "true" for that predicate (as it's inactive) and we can 
also update the predicate order based on new info, etc.
   
   I think that would be nice to have but not required: for small datasets I 
think it will be hard to collect enough information to make a good decision on 
how to evaluate filters. So unless the dataset consists of few very large files 
or the parallelism is super high (so all of the files are opened at once) it 
will be only a marginal benefit to be dynamic within a scan vs between scans. 
But maybe I’m wrong about the magnitude of the impact…


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [EPIC] Fix performance regressions when enabling parquet filter pushdown (late materialization) [datafusion]

Reply via email to