cloud-fan commented on pull request #34741:
URL: https://github.com/apache/spark/pull/34741#issuecomment-983660633


   @bersprockets After reading more ORC code, I feel the timestamp 
implementation is quite messy in ORC. Not only the reader side, but also the 
writer side shifts the timestamp value according to the JVM local timezone: 
https://github.com/apache/orc/blob/1f68ac0c7f2ae804b374500dcf1b4d7abe30ffeb/java/core/src/java/org/apache/orc/impl/writer/TimestampTreeWriter.java#L112-L113
   
   It seems like the ORC lib (the default behavior) is designed for people who 
want to deal with `java.sql.Timestamp` directly, not an engine like Spark that 
only treats ORC as a storage layer. Spark should set `useUTCTimestamp` as true 
but now it's too late as we need to support existing ORC files written by old 
Spark versions.
   
   To fix the mistake in the storage layer, we need probably years to do a 
smooth migration. My proposal is:
   Phase1: 
   1. Write TIMESTAMP_NTZ as ORC int64, with a column property to indicate it's 
TIMESTAMP_NTZ (writing TIMESTAMP_LTZ should add the column property as well)
   2. Support reading ORC TIMESTAMP_INSTANT as Spark TIMESTAMP_LTZ.
   3. When reading ORC TIMESTAMP, check the column property and only read it as 
LTZ if the column property says it's LTZ.
   
   Phase 2 (After Spark 3.3 becomes the least officially supported version):
   1. Write LTZ as ORC TIMESTAMP_INSTANT
   2. Write NTZ as ORC TIMESTAMP, with `useUTCTimestamp` set to true.
   
   also cc @gengliangwang @MaxGekk 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to