Re: [I] Wrong timestamp type read while from parquet file created by spark [arrow-datafusion]

via GitHub Thu, 02 Nov 2023 14:22:43 -0700


comphead commented on issue #7958:
URL: 
https://github.com/apache/arrow-datafusion/issues/7958#issuecomment-1791556116


   `arrow-rs` treats INT96 Parquet type as `Timestamp(NanoSecond)`
   
https://github.com/apache/arrow-rs/blob/master/parquet/src/arrow/schema/primitive.rs#L97
 
   
   Interesting explanation in Snowflake of the same issue
   
https://community.snowflake.com/s/article/TIMESTAMP-function-returns-wrong-date-time-value-from-Parquet-file
   
   Key takeaways
   - INT96 Parquet field is deprecated 
https://issues.apache.org/jira/browse/PARQUET-323
   - INT96 is only used to represent **nanosec** timestamp
   - Apache projects like Hive and Spark still incorrectly treats the first 16 
bytes, hence it returned what users thought was the correct value, but in fact 
it is incorrect.
   
   That  is the reason of having the difference. However DuckDB also works as 
Spark. To provide the compatibility support we may want introduce some config 
param in DF and treat INT96 like Spark.
   
   What are your thoughts? @alamb @waitingkuo @tustvold @viirya 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Wrong timestamp type read while from parquet file created by spark [arrow-datafusion]

Reply via email to