progval opened a new pull request, #12548: URL: https://github.com/apache/datafusion/pull/12548
Inspired by datafusion-examples/examples/advanced_parquet_index.rs ## Which issue does this PR close? This was an attempt to solve #12547, but did not achieve it, and I am not sure it is the right approach. ## Rationale for this change On every query on Parquet ables, Datafusion re-opens every file, and parses its metadata. This takes a significant time for short queries (in my use case, there is usually a single hit in the Page Index). My goal with to make these queries near-instant. Unfortunately, I realized after writing this code that the Page Index still needs to be parsed every time, because file metadata is lost through the `listing` layer (as mentioned in #9964). So this does spare some (negligible?) time parsing metadata. I'm not sure it's worth the extra complexity, especially in `ParquetFormat`. What do you think? ## What changes are included in this PR? * Made `ParquetFormat` carry state (it probably deserves a renaming then...) * Added `CachedParquetFileReaderFactory` as an alternative to `DefaultParquetFileReaderFactory`, and made it usable through a config option ## Are these changes tested? no ## Are there any user-facing changes? Added `datafusion.execution.parquet.cache_metadata` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
