[GitHub] [iceberg] yyanyy opened a new pull request #2096: Core: add schema id and schemas to table metadata

GitBox Fri, 15 Jan 2021 18:01:18 -0800


yyanyy opened a new pull request #2096:
URL: https://github.com/apache/iceberg/pull/2096

This PR adds `current-schema-id` and `schemas` to table metadata. It also
introduces a wrapper around schema to associate table schema with id.
The reason to not add ID directly into `Schema` is that currently schema
creation is widely used as a convenient method for a lot of actions that don't
involve a "real" table schema.

Next steps:
- adds `schema-id` to snapshot logs/history entries and populate
- use history entries and `schemas` to look up the right schema in time
travel queries; this may mean to add `schemas()` in `Table` API
- spec update
- add schema id to `historyTable` (will be mentioned later)

Open questions:
1. Current approach writes the newly introduced fields to JSON by default
even in v1, and there could be forward/backward compatibility concern with the
current approach: if a new writer writes (with ID 0) and then update (with ID
1) schema, metadata will store both schema ID 0 and 1, and default ID will be
1. Then an old writer reads and writes the metadata for whatever change, which
drops schema 0 in metadata. Then when a new writer picks up the metadata again,
the original schema 0 is gone, and 1 is replaced with ID 0. This could result
in schema ID consistency issue among different writers.
- Since ID is introduced in this PR, there is no metadata table that
exposes these inconsistent schema IDs, so we may not have this problem for now.
However when we start to add schema ID to `historyTable` metadata table, at
different time ID 0 could mean different things in this history table. We could
potentially workaround this by only exposing `schemaID` field in `historyTable`
only for v2 tables, or mention this caveat on spec.
- Alternatively we can expose these two fields only in v2 table, and time
travel queries in v1 always rely on looking at old table metadata files as
implemented in #1508. This could mean in future any new changes that may depend
on schema ID cannot be introduced in v1.
2. Do we want to add a `last-assigned-schema-id` to table metadata? My
answer would be yes, for a similar reason mentioned in [this
comment](https://github.com/apache/iceberg/pull/2089#issuecomment-761184851)
3. I think currently when replacing a table, earlier history
entries/`snapshotLog` will be reset to empty (second to last argument in
[here](https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/TableMetadata.java#L712)).
Is this expected? do we want to fix this as a separate issue?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] yyanyy opened a new pull request #2096: Core: add schema id and schemas to table metadata

Reply via email to