yyanyy commented on pull request #2465:
URL: https://github.com/apache/iceberg/pull/2465#issuecomment-828901103


   LGTM overall, just want to bring the below comment (that's not really 
related to the focus of this PR itself) I had originally 
[here](https://github.com/apache/iceberg/pull/2465#discussion_r616222010) to 
people's attention and make sure we are all aligned, since this PR does change 
the behavior of what we preserve in metadata fields. If we do go with the 
current approach, I'll raise a separate PR to mention the usage of schema ID 
and the caveat in spec. 
   
   Also @pvary since hive module uses schema `toJson` indicated in the comment 
below. 
   
   > Basically I think there are a lot of places where this method is called 
for serialization/deserialization of the schema as well as preserve the schema 
to some file metadata, and at that point sometimes schema object is already 
transformed/projected so it is no longer an existing schema of the table, and 
thus shouldn't have schema ID. Since schema ID is in theory only useful when 
retrieved directly from table metadata to get the right schema version for time 
travel, I think it's better to not explicitly write a schema Id except for 
writing to table metadata file, to avoid persisting an incorrect version.
   > ... 
   >Thanks for the response @rdblue ! Just want to mention an additional case: 
in addition to the possibility of writing incorrect ID for a table schema, 
writing id here also makes us persisting id for things not related to the table 
schema itself, e.g. [manifest 
list](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestListWriter.java#L94-L98)
 and [delete 
files](https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L387-L414).
 But if we are fine with writing incorrect ID, we probably will be fine with 
writing ID for non-schema structs as well...
   > 
   > I guess essentially we are choosing between 1) simpler code with potential 
confusing IDs and 2) code complexities (due to avoid writing incorrect/unuseful 
IDs); I personally would prefer the latter but I don't strongly oppose the 
former if we think it's easier to reason about in code. Also we might want to 
involve people from Hive as part of this discussion since they are using this 
method as well to preserve the schema 
([example](https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java#L119))
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to