Re: [PR] perf(reader): Avoid second `create_parquet_record_batch_stream_builder()` call for migrated tables [iceberg-rust]

via GitHub Thu, 05 Mar 2026 07:07:43 -0800


jdockerty commented on code in PR #2176:
URL: https://github.com/apache/iceberg-rust/pull/2176#discussion_r2890564559



##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -513,14 +522,51 @@ impl ArrowReader {
         .with_preload_page_index(should_load_page_index);
 
         if let Some(hint) = metadata_size_hint {
-            parquet_file_reader = 
parquet_file_reader.with_metadata_size_hint(hint);
+            reader = reader.with_metadata_size_hint(hint);
         }
 
-        // Create the record batch stream builder, which wraps the parquet 
file reader
-        let options = arrow_reader_options.unwrap_or_default();
-        let record_batch_stream_builder =
-            
ParquetRecordBatchStreamBuilder::new_with_options(parquet_file_reader, 
options).await?;
-        Ok(record_batch_stream_builder)
+        let arrow_metadata = ArrowReaderMetadata::load_async(&mut reader, 
Default::default())
+            .await
+            .map_err(|e| {
+                Error::new(ErrorKind::Unexpected, "Failed to load Parquet 
metadata").with_source(e)
+            })?;
+
+        Ok(Arc::clone(arrow_metadata.metadata()))
+    }
+
+    pub(crate) async fn create_parquet_record_batch_stream_builder(
+        data_file_path: &str,
+        file_io: FileIO,
+        should_load_page_index: bool,
+        arrow_reader_options: ArrowReaderOptions,
+        file_size_in_bytes: u64,
+        parquet_metadata: Arc<ParquetMetaData>,
+    ) -> Result<ParquetRecordBatchStreamBuilder<ArrowFileReader>> {
+        let arrow_reader_metadata =
+            ArrowReaderMetadata::try_new(parquet_metadata, 
arrow_reader_options).map_err(|e| {
+                Error::new(
+                    ErrorKind::Unexpected,
+                    "Failed to create ArrowReaderMetadata from 
ParquetMetaData",
+                )
+                .with_source(e)
+            })?;
+
+        let parquet_file = file_io.new_input(data_file_path)?;
+        let parquet_reader = parquet_file.reader().await?;
+        let parquet_file_reader = ArrowFileReader::new(
+            FileMetadata {
+                size: file_size_in_bytes,
+            },
+            parquet_reader,
+        )
+        .with_preload_column_index(true)
+        .with_preload_offset_index(true)

Review Comment:
   I'm not super familiar with the depth of details in Parquet, but are these 
universal enough to be a sane default?
   
   Can you explain the decision on these being hard-coded to `true`? Where as 
the `page_index` being pre-loaded is decided by the callee with 
`should_load_page_index`



##########
crates/iceberg/src/arrow/reader.rs:
##########
@@ -490,19 +501,17 @@ impl ArrowReader {
         Ok(Box::pin(record_batch_stream) as ArrowRecordBatchStream)
     }
 
-    pub(crate) async fn create_parquet_record_batch_stream_builder(
+    /// Loads Parquet metadata from storage, handling file size optimization.
+    pub(crate) async fn load_parquet_metadata(
         data_file_path: &str,
-        file_io: FileIO,
+        file_io: &FileIO,
         should_load_page_index: bool,
-        arrow_reader_options: Option<ArrowReaderOptions>,

Review Comment:
   How come the options are being removed here?
   
   It seems that this forces the `Default::default()` usage below at all times, 
instead of allowing reader options to be provided, is that intended?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf(reader): Avoid second `create_parquet_record_batch_stream_builder()` call for migrated tables [iceberg-rust]

Reply via email to