albertmorenosugr commented on issue #14851: URL: https://github.com/apache/iceberg/issues/14851#issuecomment-3673508295
Hi @willjanning The behavior, observed in Spark 3.5 in my case, can be explained by the interaction between Iceberg Copy-on-Write mechanism and changelog generation, rather than by changelog computation alone. In your reproduced scenario, when the table is initialized using INSERT INTO, Spark executes multiple parallel write tasks, producing multiple immutable Parquet data files (**one per task**). During a subsequent MERGE INTO, Spark performs a scan and matching phase to identify the affected data files based on the ON condition. Only the data file containing the matching row is rewritten, while other data files remain referenced from the previous snapshot. As a result, rows in unaffected files correctly retain their original change_ordinal. In contrast, when the table is initialized using MERGE INTO, the initial write in Spark 3.5 may materialize several inserted rows into a single data file (depending on the physical plan). In a subsequent MERGE INTO, that file is identified as affected if it contains at least one matching row. Since Parquet files are immutable, following the copy-on-write mechanism, Spark must rewrite the entire data file during the rewrite phase. This causes logically unchanged (NO-OP) rows that are co-located in the same file to be physically rewritten and included in the new snapshot. The key point is that appearing in a new snapshot due to a file-level rewrite is expected and correct. However, the current changelog semantics conflate physical re-materialization with logical row-level changes: rows that are rewritten solely due to file-level boundaries are assigned a new change_ordinal, even though they did not change logically. This is not an execution bug in Spark 3.5, nor incorrect behavior in Iceberg’s file rewrite logic. Rather, it highlights a limitation where changelog generation is tightly coupled to physical rewrites instead of being driven by logical row-level change semantics. Spark 4.x improves this situation by expressing MERGE INTO as a row-level operation in DataSource V2 and preserving per-row change information (e.g. INSERT/UPDATE/DELETE/NOOP) further into the execution plan. While file-level rewrites remain unavoidable for CoW, this separation allows changelog computation to be based on row-level change actions rather than on physical re-materialization alone, preventing logically unchanged rows from being reported as changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
