kevinjqliu commented on issue #1284: URL: https://github.com/apache/iceberg-python/issues/1284#issuecomment-2452399942
Thanks for the writeup! >Keep the field_ids when using pyiceberg.schema.Schema, but generate new ones when using pyarrow.Schema +1 to this Here's my take: Currently, `create_table` can take in either a Pyiceberg Schema class (`pyiceberg.schema.Schema`) or PyArrow Schema class (`pyarrow.Schema`). Inside `create_table`, PyArrow Schema is converted to PyIceberg Schema. And then the field_ids are all reassigned. https://github.com/apache/iceberg-python/blob/d559e53ed1895f947274c23de754b802a3f6c46f/pyiceberg/catalog/rest.py#L573-L576 There are 3 ways a user can interact with the `create_table` API: * Passing in a PyIceberg Schema that already has field_ids assigned * Passing in a PyIceberg Schema that needs field_ids assignment * Passing in a PyArrow Schema For the first case, we should not reassign the field_ids, but we do today. The the second case, we should assign field_ids as it is required. The the third case, we should convert to PyIceberg Schema and assign field_ids. I think it'll be more user-friendly for the `create_table` function to accept either 1. PyIceberg Schema with field_ids already assigned 2. PyArrow Schema, in which case, we'll do the conversion and field_ids assignment Today, I think the only way to craft a PyIceberg Schema that needs field_ids assignment is either through creating it by hand or using the `_convert_schema_if_needed` function. We might need a way to check the validity of the field_ids in case the Pyiceberg Schema was created by hand. And we can always couple `_convert_schema_if_needed` with `assign_fresh_schema_ids` to ensure that the PyIceberg Schemas always have a valid field_id assignment. I'm excited about this change. This might also help us streamline `create_table` with partition_spec and sort_order since there's been an issue with the field_id mismatch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
