Jefffrey commented on code in PR #1867: URL: https://github.com/apache/orc/pull/1867#discussion_r1567162308
########## site/_docs/types.md: ########## @@ -69,7 +69,9 @@ create table Foobar ( ORC includes two different forms of timestamps from the SQL world: -* **Timestamp** is a date and time without a time zone, which does not change based on the time zone of the reader. +* **Timestamp** is a date and time without a time zone, where the timestamp value is stored in the writer timezone +encoded at the stripe level, if present. ORC readers will read this value back into the reader's timezone. Usually +both writer and reader timezones default to UTC, however older ORC files may contain non-UTC writer timezones Review Comment: I'm not sure I agree with that, as this timestamp type has been a great source of confusion for me when attempting to implement an ORC to Arrow reader in Rust. For example, in the test file [TestOrcFile.testDate2038.orc](https://github.com/apache/orc/blob/d193d130eb94cd2ee737423cc2f09a00d0035ebb/examples/TestOrcFile.testDate2038.orc), according to PyArrow (which uses the ORC C++ reader underneath), the first value is: ```python >> from pyarrow import orc >>> orc.read_table("/home/jeffrey/Code/orc/examples/TestOrcFile.testDate2038.orc")[0][0] <pyarrow.TimestampScalar: '2038-05-05T12:34:56.100000000'> ``` The underlying integer value for the DATA stream is `736_601_696` seconds (let's ignore nanoseconds). When adding this to 1 Jan 2015 00:00:00, regardless of timezone, we get 5 May 2038 11:34:56. Which seems... surprising. - https://www.timeanddate.com/date/dateadded.html?d1=1&m1=1&y1=2015&type=add&ay=&am=&aw=&ad=&h1=0&i1=0&s1=0&ah=&ai=&as=736601696&rec= This isn't at all what I would expect hence my confusion (perhaps I just don't understand properly how DST factors into this). But diving into the internals and seeing reader & writer timezone offsets being considered when decoding the timestamp value just adds to the confusion, especially as that contradicts the specification. I was hoping to open up some discussion on clarifying the exact definition for this timestamp type as part of this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
