rdblue commented on code in PR #6323:
URL: https://github.com/apache/iceberg/pull/6323#discussion_r1217020366
##########
python/pyiceberg/table/__init__.py:
##########
@@ -69,21 +72,313 @@
import ray
from duckdb import DuckDBPyConnection
+ from pyiceberg.catalog import Catalog
ALWAYS_TRUE = AlwaysTrue()
+class TableUpdates:
+ _table: Table
+ _updates: Tuple[TableUpdate, ...]
+ _requirements: Tuple[TableRequirement, ...]
+
+ def __init__(
+ self,
+ table: Table,
+ actions: Optional[Tuple[TableUpdate, ...]] = None,
+ requirements: Optional[Tuple[TableRequirement, ...]] = None,
+ ):
+ self._table = table
+ self._updates = actions or ()
+ self._requirements = requirements or ()
+
+ def _append_updates(self, *new_updates: TableUpdate) -> TableUpdates:
+ """Appends updates to the set of staged updates
+
+ Args:
+ *new_updates: Any new updates
+
+ Raises:
+ ValueError: When the type of update is not unique.
+
+ Returns:
+ A new AlterTable object with the new updates appended
+ """
+ for new_update in new_updates:
+ type_new_update = type(new_update)
+ if any(type(update) == type_new_update for update in
self._updates):
+ raise ValueError(f"Updates in a single commit need to be
unique, duplicate: {type_new_update}")
Review Comment:
It looks like this class is attempting to behave like a transaction because
it will stack up a set of changes and commit them all at once. That seems
reasonable but then we get strange cases like this where there are odd
restrictions. This would definitely happen because the changes for a real
transaction would commonly include more than one `AddSnapshot` updates, but
just one `SetRefSnapshotId` update.
I think this is also going to hit an issue with complex changes, like
`UpdateSchema`. That changes supports multiple calls and then results in a
finished schema that is sent using `AddSchema` and `SetCurrentSchemaId`
updates. For the API, this would either need to include all of the schema
change methods here -- which will get ugly really fast -- or we need a way to
have a `UpdateSchema` API that returns back to the overall transaction API.
In Java, we took the second approach. There's a common `UpdateSchema` API
that can be performed as a single operation on a table
(`table.updateSchema().addColumn("x", IntType.get()).commit()`) or combined
with others in a transaction. (`table.newTransaction()` /
`transaction.updateSchema().commit()` / `transaction.commitTransaction()`).
I suspect that we want to do the same thing here and have some kind of
transaction that accumulates changes from other more specific APIs.
It looks like the issue with this PR is trying to combine the transaction
object that accumulates changes and calls `catalog.commit_table` with the
public APIs for making changes to a table. I think I would take the same
approach as Java and have a `Transaction` object to represent multiple changes
to a table, but I would hide that from users in most cases.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]