cloud-fan commented on pull request #34741: URL: https://github.com/apache/spark/pull/34741#issuecomment-983660633
@bersprockets After reading more ORC code, I feel the timestamp implementation is quite messy in ORC. Not only the reader side, but also the writer side shifts the timestamp value according to the JVM local timezone: https://github.com/apache/orc/blob/1f68ac0c7f2ae804b374500dcf1b4d7abe30ffeb/java/core/src/java/org/apache/orc/impl/writer/TimestampTreeWriter.java#L112-L113 It seems like the ORC lib (the default behavior) is designed for people who want to deal with `java.sql.Timestamp` directly, not an engine like Spark that only treats ORC as a storage layer. Spark should set `useUTCTimestamp` as true but now it's too late as we need to support existing ORC files written by old Spark versions. To fix the mistake in the storage layer, we need probably years to do a smooth migration. My proposal is: Phase1: 1. Write TIMESTAMP_NTZ as ORC int64, with a column property to indicate it's TIMESTAMP_NTZ (writing TIMESTAMP_LTZ should add the column property as well) 2. Support reading ORC TIMESTAMP_INSTANT as Spark TIMESTAMP_LTZ. 3. When reading ORC TIMESTAMP, check the column property and only read it as LTZ if the column property says it's LTZ. Phase 2 (After Spark 3.3 becomes the least officially supported version): 1. Write LTZ as ORC TIMESTAMP_INSTANT 2. Write NTZ as ORC TIMESTAMP, with `useUTCTimestamp` set to true. also cc @gengliangwang @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
