progval opened a new pull request, #12548:
URL: https://github.com/apache/datafusion/pull/12548

   Inspired by datafusion-examples/examples/advanced_parquet_index.rs
   
   ## Which issue does this PR close?
   
   This was an attempt to solve #12547, but did not achieve it, and I am not 
sure it is the right approach.
   
   ## Rationale for this change
   
   On every query on Parquet ables, Datafusion re-opens every file, and parses 
its metadata. This takes a significant time for short queries (in my use case, 
there is usually a single hit in the Page Index).
   
   My goal with to make these queries near-instant. Unfortunately, I realized 
after writing this code that the Page Index still needs to be parsed every 
time, because file metadata is lost through the `listing` layer (as mentioned 
in #9964).
   
   So this does spare some (negligible?) time parsing metadata. I'm not sure 
it's worth the extra complexity, especially in `ParquetFormat`. What do you 
think?
   
   ## What changes are included in this PR?
   
   * Made `ParquetFormat` carry state (it probably deserves a renaming then...)
   * Added `CachedParquetFileReaderFactory` as an alternative to 
`DefaultParquetFileReaderFactory`, and made it usable through a config option
   
   ## Are these changes tested?
   
   no
   
   ## Are there any user-facing changes?
   
   Added `datafusion.execution.parquet.cache_metadata`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to