adriangb commented on issue #20324: URL: https://github.com/apache/datafusion/issues/20324#issuecomment-3905097162
> See [#20160 (comment)](https://github.com/apache/datafusion/pull/20160#issuecomment-3905053370) > > I think approaches to adaptiveness/selectivity tracking also need to work _during_ file scan - otherwise we're evaluating / reading the columns without using the selectivity tracking very much. > > We probably need to integrate more with the parquet reader (i.e. implement some config / hooks / in arrow-rs) so it knows it can stop decoding the column that will yield only "true" for that predicate (as it's inactive) and we can also update the predicate order based on new info, etc. I think that would be nice to have but not required: for small datasets I think it will be hard to collect enough information to make a good decision on how to evaluate filters. So unless the dataset consists of few very large files or the parallelism is super high (so all of the files are opened at once) it will be only a marginal benefit to be dynamic within a scan vs between scans. But maybe I’m wrong about the magnitude of the impact… -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
