yyanyy commented on pull request #2465: URL: https://github.com/apache/iceberg/pull/2465#issuecomment-828901103
LGTM overall, just want to bring the below comment (that's not really related to the focus of this PR itself) I had originally [here](https://github.com/apache/iceberg/pull/2465#discussion_r616222010) to people's attention and make sure we are all aligned, since this PR does change the behavior of what we preserve in metadata fields. If we do go with the current approach, I'll raise a separate PR to mention the usage of schema ID and the caveat in spec. Also @pvary since hive module uses schema `toJson` indicated in the comment below. > Basically I think there are a lot of places where this method is called for serialization/deserialization of the schema as well as preserve the schema to some file metadata, and at that point sometimes schema object is already transformed/projected so it is no longer an existing schema of the table, and thus shouldn't have schema ID. Since schema ID is in theory only useful when retrieved directly from table metadata to get the right schema version for time travel, I think it's better to not explicitly write a schema Id except for writing to table metadata file, to avoid persisting an incorrect version. > ... >Thanks for the response @rdblue ! Just want to mention an additional case: in addition to the possibility of writing incorrect ID for a table schema, writing id here also makes us persisting id for things not related to the table schema itself, e.g. [manifest list](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/ManifestListWriter.java#L94-L98) and [delete files](https://github.com/apache/iceberg/blob/master/parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java#L387-L414). But if we are fine with writing incorrect ID, we probably will be fine with writing ID for non-schema structs as well... > > I guess essentially we are choosing between 1) simpler code with potential confusing IDs and 2) code complexities (due to avoid writing incorrect/unuseful IDs); I personally would prefer the latter but I don't strongly oppose the former if we think it's easier to reason about in code. Also we might want to involve people from Hive as part of this discussion since they are using this method as well to preserve the schema ([example](https://github.com/apache/iceberg/blob/master/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java#L119)) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
