[GitHub] [iceberg] yyanyy commented on a change in pull request #2465: Core: add row identifier to schema

GitBox Tue, 13 Apr 2021 11:19:14 -0700


yyanyy commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612661016




##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) {
       if (pretty) {
         generator.useDefaultPrettyPrinter();
       }
-      toJson(schema.asStruct(), generator);
+      toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(), 
generator);

Review comment:
       Haven't reviewed the entire PR yet, just for this specific comment - the 
reason was mentioned in 
https://github.com/apache/iceberg/pull/2096#discussion_r575716542. Basically I 
think there are a lot of places where this method is called for 
serialization/deserialization of the schema as well as preserve the schema to 
some file metadata, and at that point sometimes `schema` object is already 
transformed/projected so it is no longer an existing schema of the table, and 
thus shouldn't have schema ID. Since schema ID is in theory only useful when 
retrieved directly from table metadata to get the right schema version for time 
travel, I think it's better to not explicitly write a schema Id except for 
writing to table metadata file, to avoid persisting an incorrect version. 
   
   For the case of supporting row identifier to schema, I wonder if we want to 
do something similar - since row identifier is mostly useful during writing (to 
check table to get the default row identifier for delete) and not to read 
(since data/delete file should have this preserved in another place), we might 
want to have a separate `toJson`/`fromJson` to handle the row identifier 
parsing separately, and normally when handling schema we don't include this 
information; i.e. to make the internal model of rowId similar to what we have 
in #2354, except that serialization/deserialization is different. 
   
   @rdblue since you involved in both conversations, could you please take a 
look? Thank you!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] yyanyy commented on a change in pull request #2465: Core: add row identifier to schema

Reply via email to