Re: [PR] ORC-1671: update timestamp doc in specfication [orc]

via GitHub Sat, 30 Mar 2024 17:49:05 -0700


Jefffrey commented on code in PR #1867:
URL: https://github.com/apache/orc/pull/1867#discussion_r1545518089



##########
site/specification/ORCv1.md:
##########
@@ -1155,6 +1155,9 @@ records non-null values, a DATA stream that records the 
number of
 seconds after 1 January 2015, and a SECONDARY stream that records the
 number of nanoseconds.
 
+* Note that if writer timezone is set, 1 January 2015 is according to
+this timezone and not according to UTC

Review Comment:
   According to implementation:
   
   
https://github.com/apache/orc/blob/3b5b2a6286df48a4ab471aece74bc7b7947042ad/c%2B%2B/src/Timezone.hh#L66-L72
   
   
https://github.com/apache/orc/blob/9b79de995430b240cb68f225c22bef69ebbd8b07/c%2B%2B/src/ColumnReader.cc#L335
   
   Also, I'm not certain if the writer timezone is mandatory in the stripe if a 
TIMESTAMP column is present in the file. Might need some clarification on this?



##########
site/specification/ORCv1.md:
##########
@@ -1170,6 +1173,35 @@ DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
               | DATA            | No       | Signed Integer RLE v2
               | SECONDARY       | No       | Unsigned Integer RLE v2
 
+Due to ORC-763, values before the UNIX epoch which have nanoseconds greater
+than 999,999 are adjusted to have 1 second less.

Review Comment:
   According to
   
   
https://github.com/apache/orc/blob/9b79de995430b240cb68f225c22bef69ebbd8b07/c%2B%2B/src/ColumnReader.cc#L350-L352



##########
site/specification/ORCv1.md:
##########
@@ -1170,6 +1173,35 @@ DIRECT_V2     | PRESENT         | Yes      | Boolean RLE
               | DATA            | No       | Signed Integer RLE v2
               | SECONDARY       | No       | Unsigned Integer RLE v2
 
+Due to ORC-763, values before the UNIX epoch which have nanoseconds greater
+than 999,999 are adjusted to have 1 second less.
+
+For example, given a stripe with a TIMESTAMP column with a writer timezone
+of US/Pacific, and a reader timezone of UTC, we have the decoded integer values
+of -1,440,851,103 from the DATA stream and 199,900,000 from the SECONDARY 
stream.
+
+First we must adjust the DATA value to be relative to the UNIX epoch. The ORC
+epoch is 1 January 2015 00:00:00 US/Pacific, since we must take into account 
the writer
+timezone. This translates to 1 January 2015 08:00:00 UTC, as US/Pacific is 
equivalent
+to a -08:00 offset from UTC at that date (no daylight savings). The number of 
seconds
+from 1 January 1970 00:00:00 UTC to 1 January 2015 08:00:00 UTC is 
1,420,099,200. This is
+added to the DATA value to produce a value of -20,751,903. As this is before 
the
+UNIX epoch (since it is negative), and the SECONDARY value, 199,900,000, is
+greater than 999,999, then this DATA value is adjusted to become -20,751,904
+(1 second subtracted).
+
+This value by itself represents 5 May 1969 19:34:56.1999, which now needs to 
be adjusted
+from US/Pacific (the writer's timezone) to UTC (the reader's timezone). As the 
value is
+within daylight savings for US/Pacific, 7 hours are subtracted to give the 
final value
+of 5 May 1969 12:34:56.1999.

Review Comment:
   I'm kinda unclear on how this exactly works, I'm just going off the C++ 
implementation here:
   
   
https://github.com/apache/orc/blob/9b79de995430b240cb68f225c22bef69ebbd8b07/c%2B%2B/src/ColumnReader.cc#L335-L348
   
   Happy to be corrected if my understanding or wording is inaccurate anywhere 
:+1: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] ORC-1671: update timestamp doc in specfication [orc]

Reply via email to