to Orc uses UTC time zone

GitBox Wed, 01 Dec 2021 05:52:12 -0800


cloud-fan commented on pull request #34741:
URL: https://github.com/apache/spark/pull/34741#issuecomment-983660633

@bersprockets After reading more ORC code, I feel the timestamp
implementation is quite messy in ORC. Not only the reader side, but also the
writer side shifts the timestamp value according to the JVM local timezone:
https://github.com/apache/orc/blob/1f68ac0c7f2ae804b374500dcf1b4d7abe30ffeb/java/core/src/java/org/apache/orc/impl/writer/TimestampTreeWriter.java#L112-L113

It seems like the ORC lib (the default behavior) is designed for people who
want to deal with `java.sql.Timestamp` directly, not an engine like Spark that
only treats ORC as a storage layer. Spark should set `useUTCTimestamp` as true
but now it's too late as we need to support existing ORC files written by old
Spark versions.

To fix the mistake in the storage layer, we need probably years to do a
smooth migration. My proposal is:
Phase1:
1. Write TIMESTAMP_NTZ as ORC int64, with a column property to indicate it's
TIMESTAMP_NTZ (writing TIMESTAMP_LTZ should add the column property as well)
2. Support reading ORC TIMESTAMP_INSTANT as Spark TIMESTAMP_LTZ.
3. When reading ORC TIMESTAMP, check the column property and only read it as
LTZ if the column property says it's LTZ.

Phase 2 (After Spark 3.3 becomes the least officially supported version):
1. Write LTZ as ORC TIMESTAMP_INSTANT
2. Write NTZ as ORC TIMESTAMP, with `useUTCTimestamp` set to true.

also cc @gengliangwang @MaxGekk

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on pull request #34741: [SPARK-37463][SQL] Read/Write Timestamp ntz from/to Orc uses UTC time zone

Reply via email to