Re: [I] [EPIC] Fix performance regressions when enabling parquet filter pushdown (late materialization) [datafusion]

via GitHub Sun, 15 Feb 2026 12:32:23 -0800


Dandandan commented on issue #20324:
URL: https://github.com/apache/datafusion/issues/20324#issuecomment-3905133276


   I think you're not 100% following my point, but not sure:
   
   * I believe TPCH / TPCDS (looking locally) it the tables are I think are 
generated based on number of CPU cores, so they will be split into the number 
of core partitions during scan and will be mostly opened directly around the 
same time at "start of scan" in different threads - so `open` will be called 
for all files around the same time.
   * The current `RecordBatchReader` will _always fully evaluate_ each 
`RowFilter` (for the entire file/partition) before continuing, with the 
selection based on earlier columns. So if we add a filter it will be always be 
decoding / evaluating the entire column before continuing to the next column, 
which potentially wastes a lot of time if there might be more effective
   * Because a disabled filter now always returns "true" it scans the column 
while no longer contributing to making the selection smaller 
   
   With the current adaptiveness we could minimize the cost of evaluating the 
filter, but not remove the cost of decoding during scan of the columns passed 
in the `RowFilter`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [EPIC] Fix performance regressions when enabling parquet filter pushdown (late materialization) [datafusion]

Reply via email to