alamb commented on code in PR #17127:
URL: https://github.com/apache/datafusion/pull/17127#discussion_r2267708500


##########
datafusion/datasource-parquet/src/file_format.rs:
##########
@@ -1038,98 +1015,32 @@ impl MetadataFetch for ObjectStoreFetch<'_> {
 /// through [`ParquetFileReaderFactory`].
 ///
 /// [`ParquetFileReaderFactory`]: crate::ParquetFileReaderFactory
-pub async fn fetch_parquet_metadata<F: MetadataFetch>(
-    fetch: F,
+#[deprecated(

Review Comment:
   I left all the existing public APIs and deprecated them, and updated them to 
call the new DFParquetMetadata structure



##########
datafusion/datasource-parquet/src/file_format.rs:
##########
@@ -1935,40 +1688,9 @@ async fn output_single_parquet_file_parallelized(
     Ok(file_metadata)
 }
 
-/// Min/max aggregation can take Dictionary encode input but always produces 
unpacked

Review Comment:
   I am quite please that most of the statistics handling is now consolidated 
into its own module



##########
datafusion/datasource-parquet/src/reader.rs:
##########
@@ -256,44 +258,19 @@ impl AsyncFileReader for CachedParquetFileReader {
             #[cfg(not(feature = "parquet_encryption"))]
             let file_decryption_properties = None;
 
-            fetch_parquet_metadata(
-                &mut self.inner,
-                &file_meta.object_meta,
-                None,
-                file_decryption_properties,
-                Some(metadata_cache),
-            )
-            .await
-            .map_err(|e| {
-                parquet::errors::ParquetError::General(format!(
-                    "Failed to fetch metadata for file {}: {e}",
-                    file_meta.object_meta.location,
-                ))
-            })
+            // TODO should there be metadata prefetch hint here?

Review Comment:
   The metadata prefetch hint isn't passed here (it isn't on `main` either) but 
this refactor leads me to believe it might be helpful to do so 🤔 



##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -195,31 +194,24 @@ mod tests {
         let format = 
ParquetFormat::default().with_force_view_types(force_views);
         let schema = format.infer_schema(&ctx, &store, &meta).await?;
 
-        let stats = fetch_statistics(
-            store.as_ref(),
-            schema.clone(),
-            &meta[0],
-            None,
-            None,
-            Some(ctx.runtime_env().cache_manager.get_file_metadata_cache()),
-        )
-        .await?;
+        let file_metadata_cache =

Review Comment:
   this shows the key API difference -- instead of calling a bunch of free 
functions, you now construct a `DFParquetMetadata` and call methods on that 
struct instead



##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -392,51 +384,27 @@ mod tests {
 
         // Use a size hint larger than the parquet footer but smaller than the 
actual metadata, requiring a second fetch
         // for the remaining metadata
-        fetch_parquet_metadata(
-            ObjectStoreFetch::new(store.as_ref() as &dyn ObjectStore, 
&meta[0]),
-            &meta[0],
-            Some(9),
-            None,
-            None,
-        )
-        .await
-        .expect("error reading metadata with hint");
+        let file_metadata_cache =
+            ctx.runtime_env().cache_manager.get_file_metadata_cache();
+        let df_meta = DFParquetMetadata::new(store.as_ref(), &meta[0])
+            .with_metadata_size_hint(Some(9));
+        df_meta.fetch_metadata().await?;
         assert_eq!(store.request_count(), 2);
 
+        let df_meta =
+            
df_meta.with_file_metadata_cache(Some(Arc::clone(&file_metadata_cache)));
+
         // Increases by 3 because cache has no entries yet
-        fetch_parquet_metadata(

Review Comment:
   I think the new struct makes it much clearer what is being tested vs what is 
test setup functionality and I find the updated tests to be much easier to read



##########
datafusion/datasource-parquet/src/file_format.rs:
##########
@@ -306,30 +301,6 @@ fn clear_metadata(
     })
 }
 
-async fn fetch_schema_with_location(

Review Comment:
   Most of this PR is moving code in this module into `metadata.rs` 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to