[I] Optimize parquet_files_to_data_files by reusing schema index [iceberg-rust]

via GitHub Thu, 15 Jan 2026 12:37:50 -0800


pratik0316 opened a new issue, #2032:
URL: https://github.com/apache/iceberg-rust/issues/2032


   In the current implementation of 
[parquet_files_to_data_files](https://github.com/apache/iceberg-rust/blob/b05a675db44645becc60422b596f16cca8816a89/crates/iceberg/src/writer/file_writer/parquet_writer.rs#L312)
 in 
[parquet_writer.rs](https://github.com/apache/iceberg-rust/blob/b05a675db44645becc60422b596f16cca8816a89/crates/iceberg/src/writer/file_writer/parquet_writer.rs),
 we iterate over a list of files to convert them into Iceberg 
   
[DataFile](https://github.com/apache/iceberg-rust/blob/b05a675db44645becc60422b596f16cca8816a89/crates/iceberg/src/writer/base_writer/data_file_writer.rs#L66-L69)
 structs.
   
   However, for every single file in the loop, we invoke 
   
[parquet_to_data_file_builder](https://github.com/apache/iceberg-rust/blob/b05a675db44645becc60422b596f16cca8816a89/crates/iceberg/src/writer/file_writer/parquet_writer.rs#L350)
   , which internally rebuilds the schema index from scratch:
   
   ```
   
   // crates/iceberg/src/writer/file_writer/parquet_writer.rs
   
   pub(crate) fn parquet_to_data_file_builder(...) -> Result<DataFileBuilder> {
       // This runs for every file!
       let index_by_parquet_path = {
           let mut visitor = IndexByParquetPathName::new();
           visit_schema(&schema, &mut visitor)?;
           visitor
       };
       
       // ...
   }
   ```
   
   **Problem**: When importing a large number of files (e.g., thousands of 
files in a bulk import), we are traversing the entire schema and allocating a 
new name_to_id HashMap thousands of times, even though the schema is constant 
for the entire operation.
   
   **Proposed Solution**:
   
   - Extract the IndexByParquetPathName creation logic out of 
parquet_to_data_file_builder
   
https://github.com/apache/iceberg-rust/blob/b05a675db44645becc60422b596f16cca8816a89/crates/iceberg/src/writer/file_writer/parquet_writer.rs#L357
   - Compute this index once at the beginning of parquet_files_to_data_files 
(outside the loop).
   - Update parquet_to_data_file_builder to accept the index as a reference 
argument.
   - Reuse the same index for every file iteration.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Optimize parquet_files_to_data_files by reusing schema index [iceberg-rust]

Reply via email to