yyanyy commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r578065083
##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -253,4 +266,15 @@ public static Schema fromJson(String json) {
}
});
}
+
+ public static Schema fromJsonWithId(int schemaId, JsonNode json) {
Review comment:
I have removed `fromJsonWithId` and update `fromJson` to use the
constructor with schema Id, if Id present in Json blob, or default to
constructor that doesn't require Id. Please let me know if it is expected!
Regarding persisting schema ID in manifest, since schema and partition-spec
in both v1 and v2 are required in manifest metadata, I think it is unlikely
that we will miss schema itself but has schema ID for lookup? And I think in
manifest reader, schema is only used for constructing partition spec, so that
having a correct schema ID might not be super necessary.
Today there are a lot of places that rely on toJson/fromJson for schema,
e.g. Avro metadata, Hive table properties, scan tasks. In this PR, except for
table metadata parser, those are all using the `toJson` version that do not
write schemaId at all, which means that when the strings are read back to
Schema object they will always have id of 0; and this goes back to the
persisting schema ID question above, that we do not persist id in the manifest
header, so we are not persisting an incorrect ID, but instead constructed a
schema without ID that default to be 0 like everywhere else.
My original thinking was that schema Id should always be ignored/considered
inaccurate unless we directly get it from table metadata, as after all kinds of
projections it might be hard to track if the ID is correct. And the ID is only
used for lookup, so if we have the schema struct, it's fine to have incorrect
ID since we are unlikely to use it. But I might be overcomplicating the
problem. Do you think we are able to guarantee the ID to be right everywhere,
and do you have comment about the current state of `toJson` variations (that
only the one used by metadata parser writes out ID)?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]