kevinjqliu commented on issue #2663:
URL: 
https://github.com/apache/iceberg-python/issues/2663#issuecomment-3508497698

   Thanks for confirming, @chrisqiqiu. Glad this is resolved. I see [the fix 
from Dremio 
here](https://github.com/dremio/dremio-oss/commit/799ccbda47e6f2e1bfacf1ccbded174e00d4150a#diff-1a753e8465f90cd13c7b55bf1cfea8fa3fe3b24a15595cef20f2ea7d11e3f6edR107-R115).
 However, I believe there may be an issue with the fix: the `TIMESTAMPMILLI` 
data type should never set isAdjustedToUTC to true. I’ll raise a separate issue 
with Dremio regarding this.
   
   As for the underlying problem, I uncovered a few interesting behaviors and 
learnings:
   1. Schema mismatch between the table and data file is not compliant with the 
Iceberg spec and can lead to undefined behavior. Specifically:
   * Iceberg timestamp fields expect the Parquet schema to be timestamp with 
adjustToUtc=false.
   * Iceberg timestamptz fields expect the Parquet schema to be timestamp with 
adjustToUtc=true.
   
   2. Spark behavior: Spark allows reading mismatched timestamp/timestamptz 
from Parquet. This is mentioned above, and I also verified it locally.
   
   3. PyIceberg behavior: PyIceberg allows reading timestamp as Iceberg 
timestamptz (#2333), but not timestamptz as Iceberg timestamp. This is also 
noted above.
   
   
   > Some vendors will have different approaches to writing data files. As long 
as schema types are preserved, it should be valid.
   
   Agreed, but I think this particular issue falls into a gray area. Ideally, 
everyone should write data in accordance with the Iceberg spec. This is 
something the spec could clarify better.
   
   > I think PyIceberg's decision to treat data differently regardless of the 
schema compromises this principle.
   
   Agreed. PyIceberg currently behaves differently from Spark. Since this is 
not aligned with the spec, the behavior is undefined. However, I believe we can 
work toward aligning PyIceberg with Spark here.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to