cloud-fan edited a comment on pull request #34741: URL: https://github.com/apache/spark/pull/34741#issuecomment-983660633
@bersprockets After reading more ORC code, I feel the timestamp implementation is quite messy in ORC. Not only the reader side, but also the writer side shifts the timestamp value according to the JVM local timezone: https://github.com/apache/orc/blob/1f68ac0c7f2ae804b374500dcf1b4d7abe30ffeb/java/core/src/java/org/apache/orc/impl/writer/TimestampTreeWriter.java#L112-L113 It seems like the ORC lib (the default behavior) is designed for people who want to deal with `java.sql.Timestamp` directly, not an engine like Spark that only treats ORC as a storage layer. Spark should set `useUTCTimestamp` as true but now it's too late as we need to support existing ORC files written by old Spark versions. To fix the mistake in the storage layer, we need probably years to do a smooth migration. My proposal is: Phase1: 1. Write TIMESTAMP_NTZ as ORC int64, with a column property to indicate it's TIMESTAMP_NTZ (writing TIMESTAMP_LTZ should add the column property as well) 2. Support reading ORC TIMESTAMP_INSTANT as Spark TIMESTAMP_LTZ. 3. When reading ORC TIMESTAMP, check the column property to get the actual type (LTZ or NTZ). Phase 2 (After Spark 3.3 becomes the least officially supported version): 1. Write LTZ as ORC TIMESTAMP_INSTANT 2. Write NTZ as ORC TIMESTAMP, with `useUTCTimestamp` set to true. 3. Set `useUTCTimestamp` to true in the reader if the ORC file was written by the latest Spark version. With this proposal, we can achieve: 1. ORC files written by Spark 3.3 can be correctly read back by Spark 3.3 2. Old spark versions and other systems will read TIMESTAMP_NTZ as long (not a big issue) 3. Old ORC files can still be correctly read by Spark 3.3 4. When phase 2 is completed, all the supported Spark versions and other systems can read the ORC files written by Spark correctly. 5. When phase 2 is completed, all the supported Spark version can still read old ORC files correctly (we can look at the spark version in the file footer to decide if we should set `useUTCTimestamp` to true or not for the reader) also cc @gengliangwang @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
