abnobdoss opened a new pull request, #3384:
URL: https://github.com/apache/iceberg-python/pull/3384

   # Rationale for this change
   
   This is the first in a planned series of PRs to improve the stability and 
speed of the `Table.upsert` operation. This PR focuses on improving the 
correctness foundations by implementing "Fail Fast" validation for join key 
types.
   
   By rejecting unsupported types upfront, we prevent two major classes of 
issues:
   
   1. Silent Data Loss: Floating-point join keys are rejected because PyArrow 
joins treat -0.0 and 0.0 as distinct values while Iceberg filters treat them as 
equal, leading to missed updates.
   2. Engine Crashes: Nested types (structs, lists, maps), dictionary-encoded 
columns, and extension types (e.g., UUID) are rejected early to avoid cryptic 
C++ crashes in the underlying PyArrow join kernels.
   
   This establishes a safe contract for the subsequent performance-focused PRs 
(Vectorization and Anti-Join de-duplication).
   
   ## Are these changes tested?
   
   Yes. I have added a comprehensive suite of parameterized tests in 
`tests/table/test_upsert.py`.  
    * Validation Matrix: Verified that  ValueError  or  NotImplementedError  
are correctly raised for Floating Point, Nested, Dictionary, Null, and 
Extension types.
    * Correctness: Confirmed that standard primitive types (String, Int, Long, 
Decimal, etc.) continue to function as expected.
    * Schema Authority: Added tests ensuring that validation happens against 
both the Table Schema (architectural integrity) and the Dataframe Schema 
(memory format implementation).
   
   ## Are there any user-facing changes?
   Yes. The `Table.upsert` method now includes strict type validation for the 
join columns a user provides.
   
    * Users attempting to upsert on floating-point or nested columns will now 
receive a descriptive error message explaining the risk and suggesting a cast 
to Decimal or Integer.
    * This is a protective change that prevents users from accidentally writing 
corrupt data or encountering low-level engine crashes.
   
   For full disclosure - this PR was developed with the assistance of an AI 
coding assistant (Antigravity) to help refine the type-safety checks and 
edge-case validation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to