yyanyy commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612661016
##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) {
if (pretty) {
generator.useDefaultPrettyPrinter();
}
- toJson(schema.asStruct(), generator);
+ toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(),
generator);
Review comment:
Haven't reviewed the entire PR yet, just for this specific comment - the
reason was mentioned in
https://github.com/apache/iceberg/pull/2096#discussion_r575716542. Basically I
think there are a lot of places where this method is called for
serialization/deserialization of the schema as well as preserve the schema to
some file metadata, and at that point sometimes `schema` object is already
transformed/projected so it is no longer an existing schema of the table, and
thus shouldn't have schema ID. Since schema ID is in theory only useful when
retrieved directly from table metadata to get the right schema version for time
travel, I think it's better to not explicitly write a schema Id except for
writing to table metadata file, to avoid persisting an incorrect version.
For the case of supporting row identifier to schema, I wonder if we want to
do something similar - since row identifier is mostly useful during writing (to
check table to get the default row identifier for delete) and not to read
(since data/delete file should have this preserved in another place), we might
want to have a separate `toJson`/`fromJson` to handle the row identifier
parsing separately, and normally when handling schema we don't include this
information; i.e. to make the internal model of rowId similar to what we have
in #2354, except that serialization/deserialization is different.
@rdblue since you involved in both conversations, could you please take a
look? Thank you!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]