Dandandan commented on issue #20324: URL: https://github.com/apache/datafusion/issues/20324#issuecomment-3905133276
I think you're not 100% following my point, but not sure: * I believe TPCH / TPCDS (looking locally) it the tables are I think are generated based on number of CPU cores, so they will be split into the number of core partitions during scan and will be mostly opened directly around the same time at "start of scan" in different threads - so `open` will be called for all files around the same time. * The current `RecordBatchReader` will _always fully evaluate_ each `RowFilter` (for the entire file/partition) before continuing, with the selection based on earlier columns. So if we add a filter it will be always be decoding / evaluating the entire column before continuing to the next column, which potentially wastes a lot of time if there might be more effective * Because a disabled filter now always returns "true" it scans the column while no longer contributing to making the selection smaller With the current adaptiveness we could minimize the cost of evaluating the filter, but not remove the cost of decoding during scan of the columns passed in the `RowFilter`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
