pratik0316 opened a new issue, #2032: URL: https://github.com/apache/iceberg-rust/issues/2032
In the current implementation of [parquet_files_to_data_files](https://github.com/apache/iceberg-rust/blob/b05a675db44645becc60422b596f16cca8816a89/crates/iceberg/src/writer/file_writer/parquet_writer.rs#L312) in [parquet_writer.rs](https://github.com/apache/iceberg-rust/blob/b05a675db44645becc60422b596f16cca8816a89/crates/iceberg/src/writer/file_writer/parquet_writer.rs), we iterate over a list of files to convert them into Iceberg [DataFile](https://github.com/apache/iceberg-rust/blob/b05a675db44645becc60422b596f16cca8816a89/crates/iceberg/src/writer/base_writer/data_file_writer.rs#L66-L69) structs. However, for every single file in the loop, we invoke [parquet_to_data_file_builder](https://github.com/apache/iceberg-rust/blob/b05a675db44645becc60422b596f16cca8816a89/crates/iceberg/src/writer/file_writer/parquet_writer.rs#L350) , which internally rebuilds the schema index from scratch: ``` // crates/iceberg/src/writer/file_writer/parquet_writer.rs pub(crate) fn parquet_to_data_file_builder(...) -> Result<DataFileBuilder> { // This runs for every file! let index_by_parquet_path = { let mut visitor = IndexByParquetPathName::new(); visit_schema(&schema, &mut visitor)?; visitor }; // ... } ``` **Problem**: When importing a large number of files (e.g., thousands of files in a bulk import), we are traversing the entire schema and allocating a new name_to_id HashMap thousands of times, even though the schema is constant for the entire operation. **Proposed Solution**: - Extract the IndexByParquetPathName creation logic out of parquet_to_data_file_builder https://github.com/apache/iceberg-rust/blob/b05a675db44645becc60422b596f16cca8816a89/crates/iceberg/src/writer/file_writer/parquet_writer.rs#L357 - Compute this index once at the beginning of parquet_files_to_data_files (outside the loop). - Update parquet_to_data_file_builder to accept the index as a reference argument. - Reuse the same index for every file iteration. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
