stevenzwu commented on code in PR #16699:
URL: https://github.com/apache/iceberg/pull/16699#discussion_r3368054838


##########
spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/SparkContentFile.java:
##########
@@ -55,6 +55,7 @@ public abstract class SparkContentFile<F> implements 
ContentFile<F> {
   private final int referencedDataFilePosition;
   private final int contentOffsetPosition;
   private final int contentSizePosition;
+  private final int firstRowIdPosition;

Review Comment:
   Nit: the `FIRST_ROW_ID` schema field (id 142) sits before 
`REFERENCED_DATA_FILE`/`CONTENT_OFFSET`/`CONTENT_SIZE` (143–145), and the rest 
of the constructor mirrors that order. Suggest moving this declaration above 
`referencedDataFilePosition` (and the assignment on line 111) for consistency.



##########
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteManifestsAction.java:
##########
@@ -196,6 +196,145 @@ public void testRewriteManifestsPreservesOptionalFields() 
throws IOException {
     }
   }
 
+  @TestTemplate
+  public void testRewriteV3ManifestsPreservesFirstRowId() {
+    assumeThat(formatVersion).isGreaterThanOrEqualTo(3);
+
+    PartitionSpec spec = PartitionSpec.unpartitioned();
+    Map<String, String> options = Maps.newHashMap();
+    options.put(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion));
+    options.put(TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED, 
snapshotIdInheritanceEnabled);
+    Table table = TABLES.create(SCHEMA, spec, options, tableLocation);
+
+    writeRecords(Lists.newArrayList(new ThreeColumnRecord(1, null, "AAAA")));
+    writeRecords(Lists.newArrayList(new ThreeColumnRecord(2, "CCCC", "CCCC")));
+    table.refresh();
+
+    assertThat(table.currentSnapshot().dataManifests(table.io())).hasSize(2);
+
+    List<Row> rowsBefore =
+        spark
+            .read()
+            .format("iceberg")
+            .load(tableLocation)
+            .selectExpr("_row_id", "_last_updated_sequence_number", "*")
+            .orderBy("_row_id")
+            .collectAsList();
+
+    SparkActions.get()
+        .rewriteManifests(table)
+        .rewriteIf(manifest -> true)
+        .option(RewriteManifestsSparkAction.USE_CACHING, useCaching)
+        .execute();
+
+    assertThat(table.currentSnapshot().dataManifests(table.io())).hasSize(1);
+
+    List<Row> rowsAfter =
+        spark
+            .read()
+            .format("iceberg")
+            .load(tableLocation)
+            .selectExpr("_row_id", "_last_updated_sequence_number", "*")
+            .orderBy("_row_id")
+            .collectAsList();
+
+    assertThat(rowsAfter).containsExactlyElementsOf(rowsBefore);

Review Comment:
   `containsExactlyElementsOf` is null-safe — if `_row_id` resolves to `null` 
for both `rowsBefore` and `rowsAfter` (e.g., a future regression in 
metadata-column resolution), the equality would silently pass and hide the 
regression. `testRewriteManifestsAfterV2ToV3Upgrade` already guards against 
this with `.extracting(_row_id).doesNotContainNull()`. Suggest adding the same 
check on `rowsBefore` here and at line 286 so the assertion is self-defending.



##########
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteManifestsAction.java:
##########
@@ -196,6 +196,145 @@ public void testRewriteManifestsPreservesOptionalFields() 
throws IOException {
     }
   }
 
+  @TestTemplate
+  public void testRewriteV3ManifestsPreservesFirstRowId() {
+    assumeThat(formatVersion).isGreaterThanOrEqualTo(3);
+
+    PartitionSpec spec = PartitionSpec.unpartitioned();
+    Map<String, String> options = Maps.newHashMap();
+    options.put(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion));
+    options.put(TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED, 
snapshotIdInheritanceEnabled);
+    Table table = TABLES.create(SCHEMA, spec, options, tableLocation);
+
+    writeRecords(Lists.newArrayList(new ThreeColumnRecord(1, null, "AAAA")));
+    writeRecords(Lists.newArrayList(new ThreeColumnRecord(2, "CCCC", "CCCC")));
+    table.refresh();
+
+    assertThat(table.currentSnapshot().dataManifests(table.io())).hasSize(2);
+
+    List<Row> rowsBefore =
+        spark
+            .read()
+            .format("iceberg")
+            .load(tableLocation)
+            .selectExpr("_row_id", "_last_updated_sequence_number", "*")
+            .orderBy("_row_id")
+            .collectAsList();
+
+    SparkActions.get()
+        .rewriteManifests(table)
+        .rewriteIf(manifest -> true)
+        .option(RewriteManifestsSparkAction.USE_CACHING, useCaching)
+        .execute();
+
+    assertThat(table.currentSnapshot().dataManifests(table.io())).hasSize(1);
+
+    List<Row> rowsAfter =
+        spark
+            .read()
+            .format("iceberg")
+            .load(tableLocation)
+            .selectExpr("_row_id", "_last_updated_sequence_number", "*")
+            .orderBy("_row_id")
+            .collectAsList();
+
+    assertThat(rowsAfter).containsExactlyElementsOf(rowsBefore);
+  }
+
+  @TestTemplate
+  public void testRewriteV3PartitionedManifestsPreservesFirstRowId() {

Review Comment:
   Optional: this test differs from `testRewriteV3ManifestsPreservesFirstRowId` 
only by `PartitionSpec`. Could fold into a single helper that takes the spec. 
Existing tests in this file already split partitioned/unpartitioned variants, 
so leaving as-is is also fine.



##########
spark/v4.1/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteManifestsAction.java:
##########
@@ -196,6 +196,145 @@ public void testRewriteManifestsPreservesOptionalFields() 
throws IOException {
     }
   }
 
+  @TestTemplate
+  public void testRewriteV3ManifestsPreservesFirstRowId() {
+    assumeThat(formatVersion).isGreaterThanOrEqualTo(3);
+
+    PartitionSpec spec = PartitionSpec.unpartitioned();
+    Map<String, String> options = Maps.newHashMap();
+    options.put(TableProperties.FORMAT_VERSION, String.valueOf(formatVersion));
+    options.put(TableProperties.SNAPSHOT_ID_INHERITANCE_ENABLED, 
snapshotIdInheritanceEnabled);

Review Comment:
   Nit: `SNAPSHOT_ID_INHERITANCE_ENABLED` is a v1-only knob, but these new 
tests assert `formatVersion >= 3`, so the option is a no-op here. Same on lines 
251 and 296 — can be dropped from all three new tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to