[GitHub] [arrow] wgtmac commented on a diff in pull request #34591: GH-34590: [C++][ORC] Fix timestamp type mapping between orc and arrow

via GitHub Sun, 19 Mar 2023 23:02:53 -0700


wgtmac commented on code in PR #34591:
URL: https://github.com/apache/arrow/pull/34591#discussion_r1141659988



##########
cpp/src/arrow/adapters/orc/util.cc:
##########
@@ -1111,7 +1120,11 @@ Result<std::shared_ptr<DataType>> GetArrowType(const 
liborc::Type* type) {
     case liborc::CHAR:
       return fixed_size_binary(static_cast<int>(type->getMaximumLength()));
     case liborc::TIMESTAMP:
+      // The timestamp values stored in ORC are in the writer timezone.
       return timestamp(TimeUnit::NANO);

Review Comment:
   Yes, you're right.
   
   For `orc::TIMESTAMP` type:
   - The Orc writer expects input data (i.e. in the orc::TimestampVectorBatch) 
to be in the "UTC" timezone and serializes it into the writer timezone: 
https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnWriter.cc#L1717
   - The Orc reader deserializes the data from writer timezone and restores it 
into reader timezone: 
https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnReader.cc#L336
   
   For `orc::TIMESTAMP_INSTANT` type:
   - The Orc writer expects input data to be in the "UTC" timezone and 
serializes it into the "UTC" timezone: 
https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnWriter.cc#L1644
   - The Orc reader deserializes the data from "UTC" timezone and no more 
conversion is needed because writerTimezone and readerTimezone are both "UTC": 
https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnReader.cc#L282
   
   We have seen many issues around `orc::TIMESTAMP` type because of the 
writer-reader timezone conversion, especially with different day-light saving 
rules. So that's why `orc::TIMESTAMP_INSTANT` type is added and is always 
preferred over `orc::TIMESTAMP` type if user can take care of the timezone.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] wgtmac commented on a diff in pull request #34591: GH-34590: [C++][ORC] Fix timestamp type mapping between orc and arrow

Reply via email to