pvary commented on PR #14435: URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3540594805
I agree with @Guosmilesmile that for V3 tables, row lineage tracking is broken because the original files don’t contain row_id; they inherit them later. Consider this scenario: - Commit 1 adds File 1 (50 rows), sets first_row_id = 0, but doesn’t assign row_ids. The row_ids are 0, 1, 2, 3..., 49. - Commit 2 adds File 2 (50 rows), sets first_row_id = 50, but doesn’t assign row_ids. The row_ids are 50, 51, 52, 53..., 99. - Commit 3 adds File 3 (50 rows), sets first_row_id = 100, but doesn’t assign row_ids. The row_ids are 100, 101, 102, 103..., 149. - Commit 4 performs compaction, merging File 1 and File 2. The commit sets first_row_id = 150, and since the new file doesn’t contain row_ids, they are assigned by the current algorithm starting from 150. The row_ids are 150, 151, 152, 153..., 249. As a result, the compacted rows receive new row_ids, which breaks lineage tracking. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
