samredai opened a new pull request #3677:
URL: https://github.com/apache/iceberg/pull/3677


   This is the table metadata portion of #3227. This is complete but wanted it 
to be visible to get feedback on the direction. The equivalent logic in the 
legacy implementation is stored in 
[table_metadata.py](https://github.com/apache/iceberg/blob/master/python_legacy/iceberg/core/table_metadata.py)
 and 
[table_metadata_parser.py](https://github.com/apache/iceberg/blob/master/python_legacy/iceberg/core/table_metadata_parser.py).
   
   The idea here is that, instead of including very explicit parsing and 
validation logic for table metadata files, we can rely on the standard library 
in conjunction with [jsonschema](https://json-schema.org/) tooling to 
accomplish both. The `TABLE_METADATA_V2_SCHEMA` jsonschema definition found in 
metadata.py is an example of how this can be done (still needs to be tuned to 
the spec exactly). The `TableMetadata` class itself naively parses a given json 
object and includes a `validate()` method that validates against the defined 
jsonschema. (`validate()` simply calls the `validate_v1()` or `validate_v2()` 
static method.) Table metadata values can then be retrieved using simple dot 
notation and can be updating as well.
   ```py
   table_metadata = TableMetadata.from_s3("s3://foo/bar/baz.metadata.json", 
version=2)
   print(table_metadata.properties.read_split_target_size) # 134217728
   table_metadata.properties.read_split_target_size = 268435456
   print(table_metadata.properties.read_split_target_size) # 268435456
   ```
   
   Once tweaked, the jsonschema definition should prove re-usable since 
jsonschema parsers exist in almost every language. It may also be valuable to 
include a blessed jsonschema definition in the Iceberg docs.
   
   A proposal for editing table metadata is to use a collection of functions 
that update value(s), validate that the schema is still valid, and return the 
updating `TableMetadata` instance:
   ```py
   from copy import deepcopy
   import time
   
   def rollback(table_metadata: TableMetadata, snapshot_id: int):
     new_table_metadata = deepcopy(table_metadata)
     now_millis = int(time.time() * 1000)
     new_table_metadata.snapshot_log.append({"timestamp_millis": now_millis, 
"snapshot_id":  snapshot_id})
     new_table_metadata. current_snapshot_id = snapshot_id
     return new_table_metadata
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to