shangxinli commented on PR #14435: URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3554150937
> This is typically not an option, and very hard to find continuous stretches of row_ids. I agree there are many cases that cause discontinuity. However, for files written in consecutive commits, e.g. append-only datasets, the row IDs are continuous (e.g., File 1: rows 0-49, File 2: rows 50-99 in your example). When merging File 1 and File 2, the merged file could keep first_row_id=0 and span rows 0-99. This is a common use case: when first writing, we create many small files by increasing parallelism, and right after that we merge them into larger files. >Can we do this without rewriting the rowgroup? Do we still have gains compared to the "normal" read/write compaction? Yes, we can, and there are still significant gains. Parquet's columnar format allows us to copy most column chunks directly while rewriting only the _row_id column. Hudi's row-group rewriter does exactly this with the _file_name column. They rewrite one column while copying all other columns, and still achieve ~10x performance improvement over full read/write compaction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
