Jefffrey commented on code in PR #1867:
URL: https://github.com/apache/orc/pull/1867#discussion_r1567162308


##########
site/_docs/types.md:
##########
@@ -69,7 +69,9 @@ create table Foobar (
 
 ORC includes two different forms of timestamps from the SQL world:
 
-* **Timestamp** is a date and time without a time zone, which does not change 
based on the time zone of the reader.
+* **Timestamp** is a date and time without a time zone, where the timestamp 
value is stored in the writer timezone
+encoded at the stripe level, if present. ORC readers will read this value back 
into the reader's timezone. Usually
+both writer and reader timezones default to UTC, however older ORC files may 
contain non-UTC writer timezones

Review Comment:
   I'm not sure I agree with that, as this timestamp type has been a great 
source of confusion for me when attempting to implement an ORC to Arrow reader 
in Rust.
   
   For example, in the test file 
[TestOrcFile.testDate2038.orc](https://github.com/apache/orc/blob/d193d130eb94cd2ee737423cc2f09a00d0035ebb/examples/TestOrcFile.testDate2038.orc),
 according to PyArrow (which uses the ORC C++ reader underneath), the first 
value is:
   
   ```python
   >> from pyarrow import orc
   >>> 
orc.read_table("/home/jeffrey/Code/orc/examples/TestOrcFile.testDate2038.orc")[0][0]
   <pyarrow.TimestampScalar: '2038-05-05T12:34:56.100000000'>
   ```
   
   The underlying integer value for the DATA stream is `736_601_696` seconds 
(let's ignore nanoseconds). When adding this to 1 Jan 2015 00:00:00, regardless 
of timezone, we get 5 May 2038 11:34:56. Which seems... surprising.
   
   - 
https://www.timeanddate.com/date/dateadded.html?d1=1&m1=1&y1=2015&type=add&ay=&am=&aw=&ad=&h1=0&i1=0&s1=0&ah=&ai=&as=736601696&rec=
   
   This isn't at all what I would expect hence my confusion (perhaps I just 
don't understand properly how DST factors into this). But diving into the 
internals and seeing reader & writer timezone offsets being considered when 
decoding the timestamp value just adds to the confusion, especially as that 
contradicts the specification.
   
   I was hoping to open up some discussion on clarifying the exact definition 
for this timestamp type as part of this PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to