albertmorenosugr commented on issue #14851:
URL: https://github.com/apache/iceberg/issues/14851#issuecomment-3673006209

   Hi @willjanning 
   
   In Spark 3.5, the observed behavior is caused by how MERGE INTO materializes 
its output at the data-file level, not by changelog computation.
   
   When a table is initialized with INSERT INTO, Spark plans the VALUES input 
as a partitioned local relation and executes **multiple parallel write tasks, 
each producing an immutable Parquet data file**. As a result, rows are 
distributed across multiple data files. A subsequent MERGE INTO first performs 
a matching phase to identify affected data files, and only the data file 
containing the matched row is rewritten. Rows stored in other data files remain 
referenced from the previous snapshot and therefore correctly retain their 
original change_ordinal.
   
   When a table is initialized with MERGE INTO, Spark 3.5 produces the merge 
result as a single logical output dataset that is written into a **single 
Parquet data file**, even though the input is logically partitioned. 
Consequently, all inserted rows are co-located in the same data file. During a 
subsequent MERGE INTO, the matching phase correctly identifies the data file 
containing the matched row; however, that same file also contains logically 
unchanged rows. Because Parquet files are immutable, Spark must rewrite the 
entire data file during the merge rewrite phase. This forces logically NO-OP 
rows to be physically rewritten and included in the new snapshot, which 
correctly results in those rows being assigned change_ordinal = 1.
   
   Therefore, the issue is a Spark 3.5 MERGE INTO execution artifact: coarse 
data-file output from the initial merge causes unnecessary rewrites in later 
merges. Iceberg behaves correctly given the physical rewrite boundaries imposed 
by Spark.
   
   In Spark 4.x, MERGE INTO is expressed as a true row-level operation under 
DataSource V2, with a clearer separation between row-level change detection and 
physical file materialization. The merge output is no longer forced into a 
single data file, allowing rewrites to be isolated to affected rows’ data files 
and preventing unnecessary rewrites of logically unchanged rows. As a result, 
change ordinals are preserved correctly.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to