cloud-fan commented on pull request #34712: URL: https://github.com/apache/spark/pull/34712#issuecomment-981402675
> The check for isOldOrcFile works fine, but only one way (reading old Spark from new Spark). The other way it does not work You are right, but I don't think we can fix this bug (for timestamp ltz) without breaking forward compatibility. IIUC, the Spark ORC reader/writer has a long-standing bug, and the files written by old Spark versions can be incorrect. The Spark ORC writer shifts the timestamp values using JVM local timezone, and writes out wrong data. The Spark ORC reader shifts the timestamp value back, so the result is still correct, if the reader and writer use the same JVM local timezone. If we fix this bug in Spark 3.3, and do not shift the timestamp value, then old Spark versions will always return the wrong data if its timezone is not UTC (UTC means no shifting). On the other hand, the new Spark versions can recognize legacy ORC files and still shift timestamp values. That said, for forward compatibility (write using new version and read using old version), we must write out "wrong timestamp values". Given that most users just stay with his/her local timezone during read and write, not fixing this bug maybe a better option. Then we must do manual shifting when read/write timestamp ntz. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
