samredai opened a new pull request #3677: URL: https://github.com/apache/iceberg/pull/3677
This is the table metadata portion of #3227. This is complete but wanted it to be visible to get feedback on the direction. The equivalent logic in the legacy implementation is stored in [table_metadata.py](https://github.com/apache/iceberg/blob/master/python_legacy/iceberg/core/table_metadata.py) and [table_metadata_parser.py](https://github.com/apache/iceberg/blob/master/python_legacy/iceberg/core/table_metadata_parser.py). The idea here is that, instead of including very explicit parsing and validation logic for table metadata files, we can rely on the standard library in conjunction with [jsonschema](https://json-schema.org/) tooling to accomplish both. The `TABLE_METADATA_V2_SCHEMA` jsonschema definition found in metadata.py is an example of how this can be done (still needs to be tuned to the spec exactly). The `TableMetadata` class itself naively parses a given json object and includes a `validate()` method that validates against the defined jsonschema. (`validate()` simply calls the `validate_v1()` or `validate_v2()` static method.) Table metadata values can then be retrieved using simple dot notation and can be updating as well. ```py table_metadata = TableMetadata.from_s3("s3://foo/bar/baz.metadata.json", version=2) print(table_metadata.properties.read_split_target_size) # 134217728 table_metadata.properties.read_split_target_size = 268435456 print(table_metadata.properties.read_split_target_size) # 268435456 ``` Once tweaked, the jsonschema definition should prove re-usable since jsonschema parsers exist in almost every language. It may also be valuable to include a blessed jsonschema definition in the Iceberg docs. A proposal for editing table metadata is to use a collection of functions that update value(s), validate that the schema is still valid, and return the updating `TableMetadata` instance: ```py from copy import deepcopy import time def rollback(table_metadata: TableMetadata, snapshot_id: int): new_table_metadata = deepcopy(table_metadata) now_millis = int(time.time() * 1000) new_table_metadata.snapshot_log.append({"timestamp_millis": now_millis, "snapshot_id": snapshot_id}) new_table_metadata. current_snapshot_id = snapshot_id return new_table_metadata ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
