alamb commented on PR #18112: URL: https://github.com/apache/datafusion/pull/18112#issuecomment-3418442831
From what I can tell, to read ParquetMetadata with the default configuration of LIstingTable DataFusion will issue 3 object store requests: * 8 bytes for the footer (which has the length of the metadata) * N bytes for the metadata (has offsets to the page index, but not the page index structures itself) * M bytes for the page index structures (which are typically right before the metadata in the file, but not required to be) The first 8 byte request could be avoiding by changing the default prefetch_hint aka https://github.com/apache/datafusion/issues/18118 I think you could potentially avoiding the third request if you extended the prefetch_hint code to use the page index if it was fetched in the initial request So the flow would be DataFusion makes an initial request for the last `prefetch_hint` bytes in the file. If that happens to contain enough bytes for metadata and page index no more requests would be made. If additional data was needed additional requests would be made The newly added Push metadata decoder likely makes this easier to implement: - https://docs.rs/parquet/latest/parquet/file/metadata/struct.ParquetMetaDataPushDecoder.html (as it will tell you what ranges are needed) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
