Jefffrey commented on code in PR #1867: URL: https://github.com/apache/orc/pull/1867#discussion_r1545518089
########## site/specification/ORCv1.md: ########## @@ -1155,6 +1155,9 @@ records non-null values, a DATA stream that records the number of seconds after 1 January 2015, and a SECONDARY stream that records the number of nanoseconds. +* Note that if writer timezone is set, 1 January 2015 is according to +this timezone and not according to UTC Review Comment: According to implementation: https://github.com/apache/orc/blob/3b5b2a6286df48a4ab471aece74bc7b7947042ad/c%2B%2B/src/Timezone.hh#L66-L72 https://github.com/apache/orc/blob/9b79de995430b240cb68f225c22bef69ebbd8b07/c%2B%2B/src/ColumnReader.cc#L335 Also, I'm not certain if the writer timezone is mandatory in the stripe if a TIMESTAMP column is present in the file. Might need some clarification on this? ########## site/specification/ORCv1.md: ########## @@ -1170,6 +1173,35 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | Signed Integer RLE v2 | SECONDARY | No | Unsigned Integer RLE v2 +Due to ORC-763, values before the UNIX epoch which have nanoseconds greater +than 999,999 are adjusted to have 1 second less. Review Comment: According to https://github.com/apache/orc/blob/9b79de995430b240cb68f225c22bef69ebbd8b07/c%2B%2B/src/ColumnReader.cc#L350-L352 ########## site/specification/ORCv1.md: ########## @@ -1170,6 +1173,35 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE | DATA | No | Signed Integer RLE v2 | SECONDARY | No | Unsigned Integer RLE v2 +Due to ORC-763, values before the UNIX epoch which have nanoseconds greater +than 999,999 are adjusted to have 1 second less. + +For example, given a stripe with a TIMESTAMP column with a writer timezone +of US/Pacific, and a reader timezone of UTC, we have the decoded integer values +of -1,440,851,103 from the DATA stream and 199,900,000 from the SECONDARY stream. + +First we must adjust the DATA value to be relative to the UNIX epoch. The ORC +epoch is 1 January 2015 00:00:00 US/Pacific, since we must take into account the writer +timezone. This translates to 1 January 2015 08:00:00 UTC, as US/Pacific is equivalent +to a -08:00 offset from UTC at that date (no daylight savings). The number of seconds +from 1 January 1970 00:00:00 UTC to 1 January 2015 08:00:00 UTC is 1,420,099,200. This is +added to the DATA value to produce a value of -20,751,903. As this is before the +UNIX epoch (since it is negative), and the SECONDARY value, 199,900,000, is +greater than 999,999, then this DATA value is adjusted to become -20,751,904 +(1 second subtracted). + +This value by itself represents 5 May 1969 19:34:56.1999, which now needs to be adjusted +from US/Pacific (the writer's timezone) to UTC (the reader's timezone). As the value is +within daylight savings for US/Pacific, 7 hours are subtracted to give the final value +of 5 May 1969 12:34:56.1999. Review Comment: I'm kinda unclear on how this exactly works, I'm just going off the C++ implementation here: https://github.com/apache/orc/blob/9b79de995430b240cb68f225c22bef69ebbd8b07/c%2B%2B/src/ColumnReader.cc#L335-L348 Happy to be corrected if my understanding or wording is inaccurate anywhere :+1: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
