[
https://issues.apache.org/jira/browse/HIVE-26270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stamatis Zampetakis updated HIVE-26270:
---------------------------------------
Labels: compatibility timestamp (was: )
> Wrong timestamps when reading Hive 3.1.x Parquet files with vectorized reader
> -----------------------------------------------------------------------------
>
> Key: HIVE-26270
> URL: https://issues.apache.org/jira/browse/HIVE-26270
> Project: Hive
> Issue Type: Bug
> Components: HiveServer2, Parquet
> Reporter: Stamatis Zampetakis
> Assignee: Stamatis Zampetakis
> Priority: Major
> Labels: compatibility, timestamp
>
> Parquet files written in Hive 3.1.x onwards with timezone set to US/Pacific.
> {code:sql}
> CREATE TABLE employee (eid INT, birth timestamp) STORED AS PARQUET;
> INSERT INTO employee VALUES
> (1, '1880-01-01 00:00:00'),
> (2, '1884-01-01 00:00:00'),
> (3, '1990-01-01 00:00:00');
> {code}
> Parquet files read with Hive 4.0.0-apha-1 onwards.
> +Without vectorization+ results are correct.
> {code:sql}
> SELECT * FROM employee;
> {code}
> {noformat}
> 1 1880-01-01 00:00:00
> 2 1884-01-01 00:00:00
> 3 1990-01-01 00:00:00
> {noformat}
> +With vectorization+ some timestamps are shifted.
> {code:sql}
> -- Disable fetch task conversion to force vectorization kick in
> set hive.fetch.task.conversion=none;
> SELECT * FROM employee;
> {code}
> {noformat}
> 1 1879-12-31 23:52:58
> 2 1884-01-01 00:00:00
> 3 1990-01-01 00:00:00
> {noformat}
> The problem is the same reported under HIVE-24074. The data were written
> using the new Date/Time APIs (java.time) in version Hive 3.1.3 and here they
> were read using the old APIs (java.sql).
> The difference with HIVE-24074 is that here the problem appears only for
> vectorized execution while the non-vectorized reader is working fine so there
> is some *inconsistency in the behavior* of vectorized and non vectorized
> readers.
> Non-vectorized reader works fine cause it derives automatically that it
> should use the new JDK APIs to read back the timestamp value. This is
> possible in this case cause there are metadata information in the file (i.e.,
> the presence of {{{}writer.time.zone{}}}) from where it can infer that the
> timestamps were written using the new Date/Time APIs.
> The inconsistent behavior between vectorized and non-vectorized reader is a
> regression caused by HIVE-25104. This JIRA is an attempt to re-align the
> behavior between vectorized and non-vectorized readers.
> Note that if the file metadata are empty both vectorized and non-vectorized
> reader cannot determine which APIs to use for the conversion and in this case
> it is necessary the user to set the
> {{hive.parquet.timestamp.legacy.conversion.enabled}} explicitly to get back
> the correct results.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)