wgtmac commented on code in PR #34591:
URL: https://github.com/apache/arrow/pull/34591#discussion_r1141659988
##########
cpp/src/arrow/adapters/orc/util.cc:
##########
@@ -1111,7 +1120,11 @@ Result<std::shared_ptr<DataType>> GetArrowType(const
liborc::Type* type) {
case liborc::CHAR:
return fixed_size_binary(static_cast<int>(type->getMaximumLength()));
case liborc::TIMESTAMP:
+ // The timestamp values stored in ORC are in the writer timezone.
return timestamp(TimeUnit::NANO);
Review Comment:
Yes, you're right.
For `orc::TIMESTAMP` type:
- The Orc writer expects input data (i.e. in the orc::TimestampVectorBatch)
to be in the "UTC" timezone and serializes it into the writer timezone:
https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnWriter.cc#L1717
- The Orc reader deserializes the data from writer timezone and restores it
into reader timezone:
https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnReader.cc#L336
For `orc::TIMESTAMP_INSTANT` type:
- The Orc writer expects input data to be in the "UTC" timezone and
serializes it into the "UTC" timezone:
https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnWriter.cc#L1644
- The Orc reader deserializes the data from "UTC" timezone and no more
conversion is needed because writerTimezone and readerTimezone are both "UTC":
https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnReader.cc#L282
We have seen many issues around `orc::TIMESTAMP` type because of the
writer-reader timezone conversion, especially with different day-light saving
rules. So that's why `orc::TIMESTAMP_INSTANT` type is added and is always
preferred over `orc::TIMESTAMP` type if user can take care of the timezone.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]