comphead commented on issue #7958: URL: https://github.com/apache/arrow-datafusion/issues/7958#issuecomment-1791556116
`arrow-rs` treats INT96 Parquet type as `Timestamp(NanoSecond)` https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/schema/primitive.rs#L97 Interesting explanation in Snowflake of the same issue https://community.snowflake.com/s/article/TIMESTAMP-function-returns-wrong-date-time-value-from-Parquet-file Key takeaways - INT96 Parquet field is deprecated https://issues.apache.org/jira/browse/PARQUET-323 - INT96 is only used to represent **nanosec** timestamp - Apache projects like Hive and Spark still incorrectly treats the first 16 bytes, hence it returned what users thought was the correct value, but in fact it is incorrect. That is the reason of having the difference. However DuckDB also works as Spark. To provide the compatibility support we may want introduce some config param in DF and treat INT96 like Spark. What are your thoughts? @alamb @waitingkuo @tustvold @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
