albertmorenosugr commented on issue #14851:
URL: https://github.com/apache/iceberg/issues/14851#issuecomment-3673508295

   Hi @willjanning 
   
   The behavior, observed in Spark 3.5 in my case, can be explained by the 
interaction between Iceberg Copy-on-Write mechanism and changelog generation, 
rather than by changelog computation alone.
   
   In your reproduced scenario, when the table is initialized using INSERT 
INTO, Spark executes multiple parallel write tasks, producing multiple 
immutable Parquet data files (**one per task**). During a subsequent MERGE 
INTO, Spark performs a scan and matching phase to identify the affected data 
files based on the ON condition. Only the data file containing the matching row 
is rewritten, while other data files remain referenced from the previous 
snapshot. As a result, rows in unaffected files correctly retain their original 
change_ordinal.
   
   In contrast, when the table is initialized using MERGE INTO, the initial 
write in Spark 3.5 may materialize several inserted rows into a single data 
file (depending on the physical plan). In a subsequent MERGE INTO, that file is 
identified as affected if it contains at least one matching row. Since Parquet 
files are immutable, following the copy-on-write mechanism, Spark must rewrite 
the entire data file during the rewrite phase. This causes logically unchanged 
(NO-OP) rows that are co-located in the same file to be physically rewritten 
and included in the new snapshot.
   
   The key point is that appearing in a new snapshot due to a file-level 
rewrite is expected and correct. However, the current changelog semantics 
conflate physical re-materialization with logical row-level changes: rows that 
are rewritten solely due to file-level boundaries are assigned a new 
change_ordinal, even though they did not change logically.
   
   This is not an execution bug in Spark 3.5, nor incorrect behavior in 
Iceberg’s file rewrite logic. Rather, it highlights a limitation where 
changelog generation is tightly coupled to physical rewrites instead of being 
driven by logical row-level change semantics.
   
   Spark 4.x improves this situation by expressing MERGE INTO as a row-level 
operation in DataSource V2 and preserving per-row change information (e.g. 
INSERT/UPDATE/DELETE/NOOP) further into the execution plan. While file-level 
rewrites remain unavoidable for CoW, this separation allows changelog 
computation to be based on row-level change actions rather than on physical 
re-materialization alone, preventing logically unchanged rows from being 
reported as changed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to