[
https://issues.apache.org/jira/browse/HIVE-29033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17986435#comment-17986435
]
Vlad Rozov commented on HIVE-29033:
-----------------------------------
[~zabetak] Please checkĀ
{{org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.nextTimestamp}}. The method
converts {{TimestampColumnVector}} {{row}} to {{TimestampWritableV2}}. The
conversion requires knowledge of the time zone of the
{{TimestampColumnVector}}. Problem is that Hive blindly assumes that it is UTC
time zone. And it works within Hive as Hive uses UTC for all of its operations
and initializes {{org.apache.orc.impl.ReaderImpl}} using
{{org.apache.hadoop.hive.ql.io.orc.OrcFile.ReaderOptions}} that always sets
{{useUTCTimestamp}} to {{true}}. So far so good. Problem is when Spark tries to
integrate with Hive 4.x (I am working on a PR that upgrades Spark dependency on
Hive from 2.3.10 to 4.x). Both Spark and Hive 2.3.10 use local time zone, so
that blind conversion in Hive 4.x causes incorrect results (regression) in
Spark after upgrade.
> ORC reader should not assume that TimestampColumnVector is in UTC time zone
> ---------------------------------------------------------------------------
>
> Key: HIVE-29033
> URL: https://issues.apache.org/jira/browse/HIVE-29033
> Project: Hive
> Issue Type: Bug
> Components: Hive, ORC
> Affects Versions: 4.1.0
> Reporter: Vlad Rozov
> Assignee: Vlad Rozov
> Priority: Major
> Labels: pull-request-available
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)