Tamar-Posen commented on issue #19764: URL: https://github.com/apache/datafusion/issues/19764#issuecomment-3742533363
Yes, it's exactly about scale and metadata overhead. We operate on 100TB+ datasets with millions of Parquet files, where even reading file footers for standard pruning is costly. Our inverted index maps values directly to file references, avoiding metadata scans entirely. We often join a small predicate table with a massive base table. Previously, we used manual sideways information passing so our custom TableProvider could use the index during planning. With DynamicFilters, this logic correctly moves to execution, but we lose the ability to trigger index-based pruning before files are opened. The goal is to use runtime DynamicFilter values to query the inverted index and skip opening most files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
