shangxinli commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3486646010
Thanks @huaxingao for the review and feedback! I've addressed both of your
points.
1. IO integration:
Good catch! I've updated the implementation to use table.io() instead of
hardcoding HadoopFileIO. The new approach:
- Executors still use Hadoop Configuration for the actual Parquet file
merging (since ParquetFileMerger internally uses Parquet's appendFile which
requires Hadoop APIs)
- Driver now uses table.io().newInputFile() to read metrics, which
properly respects the catalog's configured FileIO implementation
- This ensures compatibility with different storage systems (S3, GCS,
Azure, custom FileIO implementations)
2. Executor/driver split:
I've refactored to follow the recommended pattern:
- Executors: Only perform the file merge operation and return lightweight
metadata (file path, size) via a MergeResult object
- Driver: Receives the metadata, reads metrics using table.io(), and
constructs the full DataFile objects
- This minimizes serialization overhead and keeps heavyweight objects on
the driver side
Additionally, I've addressed the architectural feedback about planning vs.
execution:
- Removed the groupFilesBySize() logic from the runner - the planner
already creates appropriately-sized groups
- Runner now merges the entire file group into a single output file
without further splitting
- This creates a cleaner separation where planning happens in the planner
and execution happens in the runner
Let me know if there are any other concerns or improvements you'd like to
see!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]