progval commented on PR #12593: URL: https://github.com/apache/datafusion/pull/12593#issuecomment-2376956688
> I wonder if we could change the "automatically load page index if needed" to "error if page index is needed but it is not loaded" 🤔 That might be a less surprising behavior I agree, and it would work for my use case because I always want the page index. I did it like this: <details><summary>Click to unfold</summary> ```diff fn get_metadata(&mut self) -> BoxFuture<'_, parquet::errors::Result<Arc<ParquetMetaData>>> { - self.inner.get_metadata() + Box::pin(self.get_metadata_async()) } fn get_byte_ranges( @@ -117,6 +118,55 @@ impl AsyncFileReader for CachingParquetFileReader { } } +impl CachingParquetFileReader { + /// Implementation of [`AsyncFileReader::get_metadata`] using new-style async, + /// so it can pass the borrow checker + async fn get_metadata_async(&mut self) -> parquet::errors::Result<Arc<ParquetMetaData>> { + match &self.metadata { + Some(metadata) => Ok(Arc::clone(metadata)), + None => match self.inner.get_metadata().await { + Ok(metadata) => { + // This function is called by `ArrowReaderMetadata::load_async`. + // Then, `load_async` may enrich the `ParquetMetaData` we return with + // the page index, using `MetadataLoader`; and this enriched + // `ParquetMetaData` reader would not be cached. + // + // Datafusion does not (currently) support caching the enriched + // `ParquetMetaData`, so we unconditionally enrich it here with + // the page index, so we can cache it. + // + // See: + // * discussion on https://github.com/apache/datafusion/pull/12593 + // * https://github.com/apache/arrow-rs/blob/62825b27e98e6719cb66258535c75c7490ddba44/parquet/src/arrow/async_reader/mod.rs#L212-L228 + let metadata = Arc::try_unwrap(metadata).unwrap_or_else(|e| e.as_ref().clone()); + let mut loader = MetadataLoader::new( + CachingParquetFileReaderMetadataFetch(self), + metadata, + ); + loader.load_page_index(true, true).await?; + let metadata = Arc::new(loader.finish()); + self.metadata = Some(Arc::clone(&metadata)); + Ok(metadata) + } + Err(e) => Err(e), + }, + } + } +} + +struct CachingParquetFileReaderMetadataFetch<'a>(&'a mut CachingParquetFileReader); + +impl<'a> MetadataFetch for CachingParquetFileReaderMetadataFetch<'a> { + fn fetch( + &mut self, + range: Range<usize>, + ) -> BoxFuture<'_, parquet::errors::Result<bytes::Bytes>> { + println!("fetch"); + self.0.fetch(range) + } +} ``` </summary> (For some reason it's a bit slower even though it doesn't call `fetch` more, I'll investigate in a few days) But for a generic solution, the `FileOpener` would need to tell `AsyncFileReader::get_metadata` (through `ArrowReaderMetadata`) whether the page index is needed, so this needs an API change (or at least addition) to Arrow. But since the `parquet` crate is being refactored, this may be the right time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org