Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

via GitHub Tue, 04 Nov 2025 07:41:01 -0800


shangxinli commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3486646010


   
   Thanks @huaxingao for the review and feedback! I've addressed both of your 
points.
     1. IO integration:
     Good catch! I've updated the implementation to use table.io() instead of 
hardcoding HadoopFileIO. The new approach:
     - Executors still use Hadoop Configuration for the actual Parquet file 
merging (since ParquetFileMerger internally uses Parquet's appendFile which 
requires Hadoop APIs)
     - Driver now uses table.io().newInputFile() to read metrics, which 
properly respects the catalog's configured FileIO implementation
     - This ensures compatibility with different storage systems (S3, GCS, 
Azure, custom FileIO implementations)
   
     2. Executor/driver split:
     I've refactored to follow the recommended pattern:
     - Executors: Only perform the file merge operation and return lightweight 
metadata (file path, size) via a MergeResult object
     - Driver: Receives the metadata, reads metrics using table.io(), and 
constructs the full DataFile objects
     - This minimizes serialization overhead and keeps heavyweight objects on 
the driver side
   
     Additionally, I've addressed the architectural feedback about planning vs. 
execution:
     - Removed the groupFilesBySize() logic from the runner - the planner 
already creates appropriately-sized groups
     - Runner now merges the entire file group into a single output file 
without further splitting
     - This creates a cleaner separation where planning happens in the planner 
and execution happens in the runner
   
     Let me know if there are any other concerns or improvements you'd like to 
see!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

Reply via email to