Re: [PR] parquet: Add support for user-provided metadata loaders [datafusion]

via GitHub Thu, 26 Sep 2024 06:21:46 -0700


progval commented on PR #12593:
URL: https://github.com/apache/datafusion/pull/12593#issuecomment-2376956688


   > I wonder if we could change the "automatically load page index if needed" 
to "error if page index is needed but it is not loaded" 🤔 That might be a less 
surprising behavior
   
   I agree, and it would work for my use case because I always want the page 
index. I did it like this:
   
   <details><summary>Click to unfold</summary>
   
   ```diff
        fn get_metadata(&mut self) -> BoxFuture<'_, 
parquet::errors::Result<Arc<ParquetMetaData>>> {
   -        self.inner.get_metadata()
   +        Box::pin(self.get_metadata_async())
        }
    
        fn get_byte_ranges(
   @@ -117,6 +118,55 @@ impl AsyncFileReader for CachingParquetFileReader {
        }
    }
    
   +impl CachingParquetFileReader {
   +    /// Implementation of [`AsyncFileReader::get_metadata`] using new-style 
async,
   +    /// so it can pass the borrow checker
   +    async fn get_metadata_async(&mut self) -> 
parquet::errors::Result<Arc<ParquetMetaData>> {
   +        match &self.metadata {
   +            Some(metadata) => Ok(Arc::clone(metadata)),
   +            None => match self.inner.get_metadata().await {
   +                Ok(metadata) => {
   +                    // This function is called by 
`ArrowReaderMetadata::load_async`.
   +                    // Then, `load_async` may enrich the `ParquetMetaData` 
we return with
   +                    // the page index, using `MetadataLoader`; and this 
enriched
   +                    // `ParquetMetaData` reader would not be cached.
   +                    //
   +                    // Datafusion does not (currently) support caching the 
enriched
   +                    // `ParquetMetaData`, so we unconditionally enrich it 
here with
   +                    // the page index, so we can cache it.
   +                    //
   +                    // See:
   +                    // * discussion on 
https://github.com/apache/datafusion/pull/12593
   +                    // * 
https://github.com/apache/arrow-rs/blob/62825b27e98e6719cb66258535c75c7490ddba44/parquet/src/arrow/async_reader/mod.rs#L212-L228
   +                    let metadata = 
Arc::try_unwrap(metadata).unwrap_or_else(|e| e.as_ref().clone());
   +                    let mut loader = MetadataLoader::new(
   +                        CachingParquetFileReaderMetadataFetch(self),
   +                        metadata,
   +                    );
   +                    loader.load_page_index(true, true).await?;
   +                    let metadata = Arc::new(loader.finish());
   +                    self.metadata = Some(Arc::clone(&metadata));
   +                    Ok(metadata)
   +                }
   +                Err(e) => Err(e),
   +            },
   +        }
   +    }
   +}
   +
   +struct CachingParquetFileReaderMetadataFetch<'a>(&'a mut 
CachingParquetFileReader);
   +
   +impl<'a> MetadataFetch for CachingParquetFileReaderMetadataFetch<'a> {
   +    fn fetch(
   +        &mut self,
   +        range: Range<usize>,
   +    ) -> BoxFuture<'_, parquet::errors::Result<bytes::Bytes>> {
   +        println!("fetch");
   +        self.0.fetch(range)
   +    }
   +}
   ```
   </summary>
   
   (For some reason it's a bit slower even though it doesn't call `fetch` more, 
I'll investigate in a few days)
   
   But for a generic solution, the `FileOpener` would need to tell 
`AsyncFileReader::get_metadata` (through `ArrowReaderMetadata`) whether the 
page index is needed, so this needs an API change (or at least addition) to 
Arrow. But since the `parquet` crate is being refactored, this may be the right 
time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] parquet: Add support for user-provided metadata loaders [datafusion]

Reply via email to