shangxinli commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3554150937

   > This is typically not an option, and very hard to find continuous 
stretches of row_ids.
   I agree there are many cases that cause discontinuity. However, for files 
written in consecutive commits, e.g. append-only datasets, the row IDs are 
continuous (e.g., File 1: rows 0-49, File 2: rows 50-99 in your example). When 
merging File 1 and File 2, the merged file could keep first_row_id=0 and span 
rows 0-99. This is a common use case: when first writing, we create many small 
files by increasing parallelism, and right after that we merge them into larger 
files.
   
   >Can we do this without rewriting the rowgroup? Do we still have gains 
compared to the "normal" read/write compaction?
   Yes, we can, and there are still significant gains. Parquet's columnar 
format allows us to copy most column chunks directly while rewriting only the 
_row_id column. Hudi's row-group rewriter does exactly this with the _file_name 
column. They rewrite one column while copying all other columns, and still 
achieve ~10x performance improvement over full read/write compaction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to