[GitHub] [iceberg] yyanyy commented on a change in pull request #2465: Core: add row identifier to schema

GitBox Tue, 13 Apr 2021 11:35:49 -0700


yyanyy commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612661016




##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) {
       if (pretty) {
         generator.useDefaultPrettyPrinter();
       }
-      toJson(schema.asStruct(), generator);
+      toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(), 
generator);

Review comment:
       Haven't reviewed the entire PR yet, just for this specific comment - the 
reason was mentioned in 
https://github.com/apache/iceberg/pull/2096#discussion_r575716542. Basically I 
think there are a lot of places where this method is called for 
serialization/deserialization of the schema as well as preserve the schema to 
some file metadata, and at that point sometimes `schema` object is already 
transformed/projected so it is no longer an existing schema of the table, and 
thus shouldn't have schema ID. Since schema ID is in theory only useful when 
retrieved directly from table metadata to get the right schema version for time 
travel, I think it's better to not explicitly write a schema Id except for 
writing to table metadata file, to avoid persisting an incorrect version. 
   
   For the case of supporting row identifier to schema, I wonder if we want to 
do something similar - since row identifier is mostly useful during writing (to 
check table to get the default row identifier for delete) and not reading 
(since data/delete file should have this preserved in another place) or any 
kind of projection (especially that with the current change during projection 
we will lose row identifier information), to ensure that we don't involve in 
the complications when working with a mutated schema object I wonder if we want 
to have a separate `toJson`/`fromJson` to handle the row identifier parsing 
just for table metadata, and once we translate table metadata file to its 
model, we read row id from table instead of schema with something like 
`table.getRowIdentifier` which may help with table 
serialization/deserialization. But this approach seems to contradict with some 
aspects in https://github.com/apache/iceberg/pull/2354#issuecomment-816337252. 
   
   @rdblue since you involved in both conversations, could you please take a 
look? Thank you!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] yyanyy commented on a change in pull request #2465: Core: add row identifier to schema

Reply via email to