Re: [PR] Deprecate `MetadataLoader` [arrow-rs]

via GitHub Sat, 28 Sep 2024 02:53:01 -0700


alamb commented on code in PR #6474:
URL: https://github.com/apache/arrow-rs/pull/6474#discussion_r1779451844



##########
parquet/src/arrow/async_reader/mod.rs:
##########
@@ -212,16 +212,18 @@ impl ArrowReaderMetadata {
         input: &mut T,
         options: ArrowReaderOptions,
     ) -> Result<Self> {
+        // TODO: this is all rather awkward. It would be nice if 
AsyncFileReader::get_metadata

Review Comment:
   Not only is this akward, it is also a common source of confusion / bugs 
(namely that when someone supplies the ParquetMetaData to the arrow reader 
options to avoid a second object store request, if often turns out the second 
fetch happens anyways to read the page index (thus obviating the attempt at 
optimization)
   
   To avoid this they need to ensure when they read the metadata in the first 
place, they also read the page index. 
   
   This is (in a roundabout way) what is happening to @progval in 
https://github.com/apache/datafusion/pull/12593
   
   I will try and file a ticket explaining the issue



##########
parquet/src/arrow/async_reader/store.rs:
##########
@@ -124,15 +124,13 @@ impl AsyncFileReader for ParquetObjectReader {
 
     fn get_metadata(&mut self) -> BoxFuture<'_, Result<Arc<ParquetMetaData>>> {
         Box::pin(async move {
-            let preload_column_index = self.preload_column_index;
-            let preload_offset_index = self.preload_offset_index;
-            let file_size = self.meta.size;
-            let prefetch = self.metadata_size_hint;
-            let mut loader = MetadataLoader::load(self, file_size, 
prefetch).await?;
-            loader
-                .load_page_index(preload_column_index, preload_offset_index)
+            let metadata = ParquetMetaDataReader::new()

Review Comment:
   
![so-beautiful](https://github.com/user-attachments/assets/41b835c4-1972-4334-88ba-39069f8885be)
   



##########
parquet/src/arrow/async_reader/metadata.rs:
##########
@@ -235,11 +249,15 @@ where
     F: FnMut(Range<usize>) -> Fut + Send,
     Fut: Future<Output = Result<Bytes>> + Send,
 {
-    let fetch = MetadataFetchFn(fetch);
-    let loader = MetadataLoader::load(fetch, file_size, prefetch).await?;
-    Ok(loader.finish())
+    let mut fetch = MetadataFetchFn(fetch);
+    ParquetMetaDataReader::new()
+        .with_prefetch_hint(prefetch)
+        .load_and_finish(&mut fetch, file_size)
+        .await
 }
 
+// these tests are all replicated in parquet::file::metadata::reader

Review Comment:
   👍 



##########
parquet/src/file/metadata/reader.rs:
##########
@@ -299,11 +299,14 @@ impl ParquetMetaDataReader {
     /// See [`Self::with_prefetch_hint`] for a discussion of how to reduce the 
number of fetches
     /// performed by this function.
     #[cfg(feature = "async")]
-    pub async fn load_and_finish<F: MetadataFetch>(
+    pub async fn load_and_finish<F>(

Review Comment:
   since this API is new and has not yet been released, this is not a breaking 
API change



##########
parquet/src/file/metadata/reader.rs:
##########
@@ -299,11 +299,14 @@ impl ParquetMetaDataReader {
     /// See [`Self::with_prefetch_hint`] for a discussion of how to reduce the 
number of fetches
     /// performed by this function.
     #[cfg(feature = "async")]
-    pub async fn load_and_finish<F: MetadataFetch>(
+    pub async fn load_and_finish<F>(
         mut self,
-        fetch: F,
+        fetch: &mut F,
         file_size: usize,
-    ) -> Result<ParquetMetaData> {
+    ) -> Result<ParquetMetaData>
+    where
+        for<'a> &'a mut F: MetadataFetch,

Review Comment:
   > cc @alamb. I did not like that the load... functions in 
ParquetMetaDataReader took ownership of fetch.
   
   What don't you like about it? Is the idea that `MetadataFetch` actually has 
mutable state so that calling `fetch` would change its state?
   
   I think in general trying to get async functions to be happy with references 
is often much harder than with owned objects. 
   
   So I guess I think we should keep the API as owned to make working with 
these APIs simpler, unless there is some compelling reason for the change 
(better performance, etc)
   



##########
parquet/src/file/metadata/reader.rs:
##########
@@ -299,11 +299,14 @@ impl ParquetMetaDataReader {
     /// See [`Self::with_prefetch_hint`] for a discussion of how to reduce the 
number of fetches
     /// performed by this function.
     #[cfg(feature = "async")]
-    pub async fn load_and_finish<F: MetadataFetch>(
+    pub async fn load_and_finish<F>(

Review Comment:
   (Specifically `ParquetMetaDataReader` has not been released yet)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Deprecate `MetadataLoader` [arrow-rs]

Reply via email to