Re: [PR] Support eager loading page index parquet metadata [datafusion]

via GitHub Tue, 21 Oct 2025 03:14:45 -0700


alamb commented on PR #18112:
URL: https://github.com/apache/datafusion/pull/18112#issuecomment-3425800587


   > This PR is still needed, because currently we split two part of loading 
metadata if we enable page index, and the second one loading page index will 
not use reminder, so even we setting prefetch_hint now, we can't reduce the 
page index request without this PR.
   
   As I understand it, your goal is to reduce the number of object store 
requests when loading parquet metadata (which is a good goal 👍 )
   
   However, I am not sure this PR achieve this goal-- instead what i think this 
PR does is change **when** the requests are made to object store to a bit 
earlier in the processing pipeline, but not the actual number of them.
   
   I think the best way to ensure we minimize the object store requests is:
   1. Use `prefetch_hint` to make a single request and read a portion of the 
end of the file
   2. Try and parse both the metadata and page index from the result (if the 
prefetch was bit enough it will have both structures)
   
   If prefetch fetches enough bytes, this strategy will result in a single 
object store requests to read all required metadata
   
   I realize setting prefetch today may not actually also parse the page index, 
but I think that is what we should be working towards (rather than adding 
another flag, unless there is some need I am missing)
   
   i personally suggest starting with an end to end type test (perhaps in 
https://github.com/apache/datafusion/blob/main/datafusion/core/tests/parquet_config.rs)
 that illustrates what is happening:
   1. Runs a SQL query from a parquet file
   2. Uses an instrumented object store (e.g. something like 
https://github.com/apache/datafusion/blob/f363e382661a4f45dad2912e9988f1703e46939b/datafusion/core/src/datasource/file_format/parquet.rs#L304-L303
 or 
https://github.com/apache/datafusion/blob/93f136c06dcb6d4cb362110ae5a4b2b3b8571bb7/datafusion-cli/src/object_storage/instrumented.rs#L253-L252)
 to verify what requests are made
   
   Then we can configure various prefetch settings and ensure that only the 
expected number of requests are made
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Support eager loading page index parquet metadata [datafusion]

Reply via email to