beliefer opened a new pull request #34769: URL: https://github.com/apache/spark/pull/34769
### What changes were proposed in this pull request? This PR used to fix the issue https://github.com/apache/spark/pull/33588#issuecomment-978719988 The root cause is Orc write/read timestamp with local timezone in default. The local timezone will be changed. If the Orc writer write timestamp with local timezone(e.g. America/Los_Angeles), when the Orc reader reading the timestamp with local timezone(e.g. Europe/Amsterdam), the value of timestamp will be different. If we let the Orc writer write timestamp with UTC timezone, when the Orc reader reading the timestamp with UTC timezone too, the value of timestamp will be correct. This PR let Orc write/read Timestamp with UTC timezone by call `useUTCTimestamp(true)` for readers or writers. The related Orc source: https://github.com/apache/orc/blob/3f1e57cf1cebe58027c1bd48c09eef4e9717a9e3/java/core/src/java/org/apache/orc/impl/WriterImpl.java#L525 https://github.com/apache/orc/blob/1f68ac0c7f2ae804b374500dcf1b4d7abe30ffeb/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java#L1184 Another problem is Spark 3.3 or newer read the Orc file written by Spark 3.2 or prior. Because the older Spark write timestamp with local timezone, no need to read them with UTC timezone. Otherwise, an incorrect value of timestamp occurs. ### Why are the changes needed? Fix the bug for Orc timestamp. ### Does this PR introduce _any_ user-facing change? Orc timestamp ntz is a new feature not release yet. ### How was this patch tested? New tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
