[GitHub] [iceberg] yyanyy commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

GitBox Wed, 17 Feb 2021 17:37:15 -0800


yyanyy commented on a change in pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096#discussion_r578065083




##########
File path: core/src/main/java/org/apache/iceberg/SchemaParser.java
##########
@@ -253,4 +266,15 @@ public static Schema fromJson(String json) {
       }
     });
   }
+
+  public static Schema fromJsonWithId(int schemaId, JsonNode json) {

Review comment:
       I have removed `fromJsonWithId` and update `fromJson` to use the 
constructor with schema Id, if Id present in Json blob, or default to 
constructor that doesn't require Id. Please let me know if it is expected!
   
   Regarding persisting schema ID in manifest, since schema and partition-spec 
in both v1 and v2 are required in manifest metadata, I think it is unlikely 
that we will miss schema itself but has schema ID for lookup? And I think in 
manifest reader, schema is only used for constructing partition spec, so that 
having a correct schema ID might not be super necessary. 
   
   Today there are a lot of places that rely on toJson/fromJson for schema, 
e.g. Avro metadata, Hive table properties, scan tasks. In this PR, except for 
table metadata parser, those are all using the `toJson` version that do not 
write schemaId at all, which means that when the strings are read back to 
Schema object they will always have id of 0; and this goes back to the 
persisting schema ID question above, that we do not persist id in the manifest 
header, so we are not persisting an incorrect ID, but instead constructed a 
schema without ID that default to be 0 like everywhere else. 
   
   My original thinking was that schema Id should always be ignored/considered 
inaccurate unless we directly get it from table metadata, as after all kinds of 
projections it might be hard to track if the ID is correct. And the ID is only 
used for lookup, so if we have the schema struct, it's fine to have incorrect 
ID since we are unlikely to use it. But I might be overcomplicating the 
problem. Do you think we are able to guarantee the ID to be right everywhere, 
and do you have comment about the current state of `toJson` variations (that 
only the one used by metadata parser writes out ID)? 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] yyanyy commented on a change in pull request #2096: Core: add schema id and schemas to table metadata

Reply via email to