alamb opened a new issue, #21554: URL: https://github.com/apache/datafusion/issues/21554
### Is your feature request related to a problem or challenge? In the parquet opener, DataFusion currently does per-file schema adaptation and pruning setup, including predicate rewrites and pruning predicate construction: - https://github.com/apache/datafusion/blob/590a5178c8ffb17873f612a9c1da234fc1a18ff3/datafusion/datasource-parquet/src/opener.rs#L743-L788 - https://github.com/apache/datafusion/blob/590a5178c8ffb17873f612a9c1da234fc1a18ff3/datafusion/datasource-parquet/src/opener.rs#L1523-L1547 As @adriangb noted on https://github.com/apache/datafusion/pull/21480#issuecomment-4215673477, many deployments only have a small number of physical schemas, often just one, so repeating the same work across many files is wasteful. PR #21480 from @fpetkovski improved this area by avoiding page pruning predicate construction unless page indexes are enabled, but there still seems to be a follow-on opportunity to cache equivalent pruning setup across files with the same physical schema. ### Describe the solution you'd like Cache parquet pruning setup across files when the physical schema and other correctness-relevant inputs are the same. This likely includes: - expression/schema rewrite results - pruning predicate construction - page pruning predicate construction where applicable ### Describe alternatives you've considered Continue with smaller local optimizations like PR #21480, or add more one-off fast paths. Those help, but caching shared setup seems like the more direct way to avoid repeated work. ### Additional context Relevant links: - Tracking comment from @adriangb: https://github.com/apache/datafusion/pull/21480#issuecomment-4215673477 - Original PR from @fpetkovski: https://github.com/apache/datafusion/pull/21480 - Page index loading / page pruning setup: https://github.com/apache/datafusion/blob/590a5178c8ffb17873f612a9c1da234fc1a18ff3/datafusion/datasource-parquet/src/opener.rs#L793-L839 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
