kevinjqliu commented on issue #2663: URL: https://github.com/apache/iceberg-python/issues/2663#issuecomment-3508497698
Thanks for confirming, @chrisqiqiu. Glad this is resolved. I see [the fix from Dremio here](https://github.com/dremio/dremio-oss/commit/799ccbda47e6f2e1bfacf1ccbded174e00d4150a#diff-1a753e8465f90cd13c7b55bf1cfea8fa3fe3b24a15595cef20f2ea7d11e3f6edR107-R115). However, I believe there may be an issue with the fix: the `TIMESTAMPMILLI` data type should never set isAdjustedToUTC to true. I’ll raise a separate issue with Dremio regarding this. As for the underlying problem, I uncovered a few interesting behaviors and learnings: 1. Schema mismatch between the table and data file is not compliant with the Iceberg spec and can lead to undefined behavior. Specifically: * Iceberg timestamp fields expect the Parquet schema to be timestamp with adjustToUtc=false. * Iceberg timestamptz fields expect the Parquet schema to be timestamp with adjustToUtc=true. 2. Spark behavior: Spark allows reading mismatched timestamp/timestamptz from Parquet. This is mentioned above, and I also verified it locally. 3. PyIceberg behavior: PyIceberg allows reading timestamp as Iceberg timestamptz (#2333), but not timestamptz as Iceberg timestamp. This is also noted above. > Some vendors will have different approaches to writing data files. As long as schema types are preserved, it should be valid. Agreed, but I think this particular issue falls into a gray area. Ideally, everyone should write data in accordance with the Iceberg spec. This is something the spec could clarify better. > I think PyIceberg's decision to treat data differently regardless of the schema compromises this principle. Agreed. PyIceberg currently behaves differently from Spark. Since this is not aligned with the spec, the behavior is undefined. However, I believe we can work toward aligning PyIceberg with Spark here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
