tustvold commented on issue #3463: URL: https://github.com/apache/datafusion/issues/3463#issuecomment-3708285486
It depends what you mean by IO 😅, if you mean fetching data from disk / network, you are correct predicate pushdown being discussed here (late materialization) does not influence IO. The only predicate pushdown that influences IO is using statistics to generate a RowSelection that filters out entire pages based on the page index. This is by design, as it allows for vectored IO / read coalescing which is critical for decent performance on object stores. Or to phrase it differently - DF enabling predicate pushdown will not influence the IO pattern to disk, and therefore this cannot be responsible for the regression in performance. What https://github.com/apache/arrow-rs/pull/8733 does do is change the way the parquet process actually decodes the fetched bytes, allowing it to effectively give-up on trying to use a filter that isn't proving to be very selective. This improves the worst case regression for pushing down a "bad" filter, although is still not as cheap as not pushing the filter down at all. It's also worth noting that the parquet reader doesn't really care about selectivity, what it cares about is how contiguous the filter is. If the filter only filters out 1% of the rows, but they're all consecutive, that is still a good filter to push down. _This is based on knowledge of the parquet reader that may be a year out of date so might be slightly outdated_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
