cloud-fan commented on pull request #34712:
URL: https://github.com/apache/spark/pull/34712#issuecomment-981402675


   > The check for isOldOrcFile works fine, but only one way (reading old Spark 
from new Spark). The other way it does not work
   
   You are right, but I don't think we can fix this bug (for timestamp ltz) 
without breaking forward compatibility.
   
   IIUC, the Spark ORC reader/writer has a long-standing bug, and the files 
written by old Spark versions can be incorrect. The Spark ORC writer shifts the 
timestamp values using JVM local timezone, and writes out wrong data. The Spark 
ORC reader shifts the timestamp value back, so the result is still correct, if 
the reader and writer use the same JVM local timezone.
   
   If we fix this bug in Spark 3.3, and do not shift the timestamp value, then 
old Spark versions will always return the wrong data if its timezone is not UTC 
(UTC means no shifting). On the other hand, the new Spark versions can 
recognize legacy ORC files and still shift timestamp values.
   
   That said, for forward compatibility (write using new version and read using 
old version), we must write out "wrong timestamp values". Given that most users 
just stay with his/her local timezone during read and write, not fixing this 
bug maybe a better option. Then we must do manual shifting when read/write 
timestamp ntz.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to