yyanyy commented on a change in pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#discussion_r612661016
##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -158,7 +169,7 @@ public static String toJson(Schema schema, boolean pretty) {
if (pretty) {
generator.useDefaultPrettyPrinter();
}
- toJson(schema.asStruct(), generator);
+ toJson(schema.asStruct(), schema.schemaId(), schema.rowIdentifiers(),
generator);
Review comment:
Haven't reviewed the entire PR yet, just for this specific comment - the
reason was mentioned in
https://github.com/apache/iceberg/pull/2096#discussion_r575716542. Basically I
think there are a lot of places where this method is called for
serialization/deserialization of the schema as well as preserve the schema to
some file metadata, and at that point sometimes `schema` object is already
transformed/projected so it is no longer an existing schema of the table, and
thus shouldn't have schema ID. Since schema ID is in theory only useful when
retrieved directly from table metadata to get the right schema version for time
travel, I think it's better to not explicitly write a schema Id except for
writing to table metadata file, to avoid persisting an incorrect version.
For the case of supporting row identifier to schema, I wonder if we want to
do something similar - since row identifier is mostly useful during writing (to
check table to get the default row identifier for delete) and not reading
(since data/delete file should have this preserved in another place) or any
kind of projection (especially that with the current change during projection
we will lose row identifier information), to ensure that we don't involve in
the complications when working with a mutated schema object I wonder if we want
to have a separate `toJson`/`fromJson` to handle the row identifier parsing
just for table metadata, and once we translate table metadata file to its
model, we read row id from table instead of schema with something like
`table.getRowIdentifier` which may help with table
serialization/deserialization. But this approach seems to contradict with some
aspects in https://github.com/apache/iceberg/pull/2354#issuecomment-816337252.
@rdblue since you involved in both conversations, could you please take a
look? Thank you!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]