Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

via GitHub Sat, 15 Nov 2025 17:12:39 -0800


shangxinli commented on PR #14435:
URL: https://github.com/apache/iceberg/pull/14435#issuecomment-3537257510


   > Thanks for the PR ! I have a question about the lineage, If the merging is 
only performed at the parquet layer, will the lineage information of the v3 
table be disrupted?
   
   Good question! The lineage information for v3 tables is preserved in two 
ways:
   
     1. Field IDs (Schema Lineage)
   
     Field IDs are preserved because we strictly enforce identical schemas 
across all files being merged.
   
     In ParquetFileMerger.java:130-136, we validate that all input files have 
exactly the same Parquet MessageType schema:
   
     if (!schema.equals(currentSchema)) {
         throw new IllegalArgumentException(
             String.format("Schema mismatch detected: file '%s' has schema %s 
but file '%s' has schema %s. "
                 + "All files must have identical Parquet schemas for row-group 
level merging.", ...));
     }
   
     Field IDs are stored directly in the Parquet schema structure itself (via 
Type.getId()), so when we copy row groups using ParquetFileWriter.appendFile() 
with the validated schema, all field IDs are preserved.
   
     2. Row IDs (Row Lineage for v3+)
   
     Row IDs are automatically assigned by Iceberg's commit framework - we 
don't need special handling in the merger.
   
     Here's how it works:
   
     1. Our code creates DataFile objects with metrics (including recordCount) 
but without firstRowId - see SparkParquetFileMergeRunner.java:236-243
     2. During commit, SnapshotProducer creates a ManifestListWriter 
initialized with base.nextRowId() (the table's current row ID counter) - see 
SnapshotProducer.java:273
     3. ManifestListWriter.prepare() automatically assigns firstRowId to each 
manifest and increments the counter by the number of rows - see 
ManifestListWriter.java:136-140:
     // assign first-row-id and update the next to assign
     wrapper.wrap(manifest, nextRowId);
     this.nextRowId += manifest.existingRowsCount() + manifest.addedRowsCount();
     4. The snapshot is committed with the updated nextRowId, ensuring all row 
IDs are correctly tracked
   
     This is the same mechanism used by all Iceberg write operations, so row 
lineage is fully preserved for v3 tables.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add ParquetFileMerger for efficient row-group level file merging [iceberg]

Reply via email to