abnobdoss opened a new pull request, #3384:
URL: https://github.com/apache/iceberg-python/pull/3384
# Rationale for this change
This is the first in a planned series of PRs to improve the stability and
speed of the `Table.upsert` operation. This PR focuses on improving the
correctness foundations by implementing "Fail Fast" validation for join key
types.
By rejecting unsupported types upfront, we prevent two major classes of
issues:
1. Silent Data Loss: Floating-point join keys are rejected because PyArrow
joins treat -0.0 and 0.0 as distinct values while Iceberg filters treat them as
equal, leading to missed updates.
2. Engine Crashes: Nested types (structs, lists, maps), dictionary-encoded
columns, and extension types (e.g., UUID) are rejected early to avoid cryptic
C++ crashes in the underlying PyArrow join kernels.
This establishes a safe contract for the subsequent performance-focused PRs
(Vectorization and Anti-Join de-duplication).
## Are these changes tested?
Yes. I have added a comprehensive suite of parameterized tests in
`tests/table/test_upsert.py`.
* Validation Matrix: Verified that ValueError or NotImplementedError
are correctly raised for Floating Point, Nested, Dictionary, Null, and
Extension types.
* Correctness: Confirmed that standard primitive types (String, Int, Long,
Decimal, etc.) continue to function as expected.
* Schema Authority: Added tests ensuring that validation happens against
both the Table Schema (architectural integrity) and the Dataframe Schema
(memory format implementation).
## Are there any user-facing changes?
Yes. The `Table.upsert` method now includes strict type validation for the
join columns a user provides.
* Users attempting to upsert on floating-point or nested columns will now
receive a descriptive error message explaining the risk and suggesting a cast
to Decimal or Integer.
* This is a protective change that prevents users from accidentally writing
corrupt data or encountering low-level engine crashes.
For full disclosure - this PR was developed with the assistance of an AI
coding assistant (Antigravity) to help refine the type-safety checks and
edge-case validation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]