shangxinli commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3539161303
Summary of Changes Made to Address PR Comments in Round 3:
1. Removed "zero-copy" terminology
- Files: RewriteDataFiles.java, ParquetFileMerger.java
- Change: Replaced "zero-copy" with more accurate "directly copying row
groups without deserialization and re-serialization"
2. Added missing restrictions to javadoc
- Files: RewriteDataFiles.java, ParquetFileMerger.java
- Added:
- Files must not have associated delete files or delete vectors
- Table must not have a sort order (including z-ordered tables)
- All files must have compatible schemas (in ParquetFileMerger)
3. Replaced deprecated ParquetFileWriter constructor
- File: ParquetFileMerger.java
- Change: Switched from deprecated 5-parameter constructor to
non-deprecated 8-parameter constructor with proper defaults
4. Made configuration values user-configurable
- File: SparkParquetFileMergeRunner.java, ParquetFileMerger.java
- Added:
- rowGroupSize: Read from table property
write.parquet.row-group-size-bytes (default 128MB)
- columnIndexTruncateLength: Read from Hadoop config
parquet.columnindex.truncate.length (default 64)
5. Implemented encryption check
- File: SparkParquetFileMergeRunner.java
- Change: Added ParquetCryptoRuntimeException catch to detect encrypted
files
6. Removed custom grouping logic (planner concern)
- File: SparkParquetFileMergeRunner.java
- Change: Replaced groupFilesBySize() with distributeFilesEvenly() to
trust planner's expectedOutputFiles calculation
7. Updated documentation
- File: docs/docs/maintenance.md
- Added: New section documenting Parquet row-group level merging feature
with requirements and usage
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]