lintingbin commented on PR #14435: URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3834417714
Hi @shangxinli, great work on this PR! I'm from the Apache Amoro project and we're very interested in leveraging this optimization. I have a question about the row-group merging behavior: When merging many small files where each file contains small row groups (e.g., < 1MB per row group), the merged output file will still contain many small row groups since `reader.appendTo(writer)` copies row groups as-is without combining them. As noted in [PARQUET-1115](https://issues.apache.org/jira/browse/PARQUET-1115): > "When used to merge many small files, the resulting file will still contain small row groups and one loses most of the advantages of larger files." This could negatively impact read performance (predicate pushdown efficiency, vectorized read benefits, etc.). **Questions:** 1. Is there any plan to add a minimum row-group size threshold to determine eligibility for binary merge? 2. Or perhaps a hybrid mode that falls back to row-level rewrite when source row groups are below a certain size? 3. Should the caller be responsible for checking row group sizes before calling `ParquetFileMerger.mergeFiles()`? 4. **For a two-phase approach**: Could we first use traditional row-level rewrite to merge small files into larger files (with proper-sized row groups), and then use `ParquetFileMerger` to merge those larger files? Would this be a recommended pattern, or is there a more efficient way to handle this scenario? Thanks for your great contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
