yyanyy commented on a change in pull request #2556:
URL: https://github.com/apache/iceberg/pull/2556#discussion_r628519164



##########
File path: api/src/main/java/org/apache/iceberg/types/TypeUtil.java
##########
@@ -213,7 +224,7 @@ public static Schema assignIncreasingFreshIds(Schema 
schema) {
    */
   public static Schema reassignIds(Schema schema, Schema idSourceSchema) {
     Types.StructType struct = visit(schema, new 
ReassignIds(idSourceSchema)).asStructType();
-    return new Schema(struct.fields());
+    return new Schema(struct.fields(), refreshIdentifierFields(struct, 
schema));

Review comment:
       since `schema` passed in to this method is almost always a schema just 
got constructed (and thus have to call this method to assign the right ids), I 
think `refreshIdentifierFields(struct, schema)` here will almost always be a 
no-op? 
   
   I wonder if we want to use `idSourceSchema` to get the identifier fields, 
although that might have a different problem that the input `schema` could be a 
subset of `idSourceSchema` and thus doesn't include all identifier fields. 
Though in this case identifier fields won't be useful and we may skip it, and 
we can visit two schema and verify they have the same number of columns to 
identify this case. I'm not sure about the use case of this method and if we 
really need to assign identifier fields here though. 

##########
File path: api/src/test/java/org/apache/iceberg/types/TestTypeUtil.java
##########
@@ -42,6 +44,44 @@ public void testReassignIdsDuplicateColumns() {
     Assert.assertEquals(sourceSchema.asStruct(), actualSchema.asStruct());
   }
 
+  @Test
+  public void testReassignIdsWithIdentifier() {
+    Schema schema = new Schema(
+        Lists.newArrayList(
+            required(0, "a", Types.IntegerType.get()),
+            required(1, "A", Types.IntegerType.get())),
+        Sets.newHashSet(0)
+    );
+    Schema sourceSchema = new Schema(
+        Lists.newArrayList(
+            required(1, "a", Types.IntegerType.get()),
+            required(2, "A", Types.IntegerType.get())),
+        Sets.newHashSet(1)
+    );
+    final Schema actualSchema = TypeUtil.reassignIds(schema, sourceSchema);
+    Assert.assertEquals(sourceSchema.asStruct(), actualSchema.asStruct());
+    Assert.assertEquals(sourceSchema.identifierFieldIds(), 
actualSchema.identifierFieldIds());
+  }
+
+  @Test
+  public void testAssignIncreasingFreshIdWithIdentifier() {
+    Schema schema = new Schema(
+        Lists.newArrayList(
+            required(10, "a", Types.IntegerType.get()),
+            required(11, "A", Types.IntegerType.get())),
+        Sets.newHashSet(10)
+    );
+    Schema sourceSchema = new Schema(
+        Lists.newArrayList(
+            required(1, "a", Types.IntegerType.get()),
+            required(2, "A", Types.IntegerType.get())),
+        Sets.newHashSet(1)
+    );
+    final Schema actualSchema = TypeUtil.assignIncreasingFreshIds(schema);
+    Assert.assertEquals(sourceSchema.asStruct(), actualSchema.asStruct());
+    Assert.assertEquals(sourceSchema.identifierFieldIds(), 
actualSchema.identifierFieldIds());
+  }
+

Review comment:
       do we want to add a case to call `assignFreshIds(Schema schema, Schema 
baseSchema, NextID nextId)` with `schema` doesn't have identifier fields but 
`baseSchema` has, and the output schema will have identifier fields? I think 
it's an existing use case we have in TableMetadata that may worth explicit 
testing. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to