[
https://issues.apache.org/jira/browse/IMPALA-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Smith resolved IMPALA-13627.
------------------------------------
Fix Version/s: Impala 4.6.0
Resolution: Fixed
> Impala uses different timezone conversion when reading Hive with legacy
> conversion
> ----------------------------------------------------------------------------------
>
> Key: IMPALA-13627
> URL: https://issues.apache.org/jira/browse/IMPALA-13627
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 4.4.1
> Reporter: Michael Smith
> Assignee: Michael Smith
> Priority: Critical
> Labels: avro, parquet
> Fix For: Impala 4.6.0
>
>
> HIVE-12192, HIVE-20007 changed the way that timestamp computations are
> performed and to some extend how timestamps are serialized and deserialized
> in files (Parquet, Avro) by Hive. HIVE-25104 was added to allow Hive to
> continue to write files using the legacy timestamp conversion so that older
> Hive versions can read the correct time.
> All of that is background to say that Impala - when converting UTC timestamps
> with {{convert_legacy_hive_parquet_utc_timestamps}} - does not mirror Hive's
> timestamp conversion when reading INT96 coded timestamps that Hive converted
> from local time to UTC using legacy timezone conversion. To reproduce
> # Start Hive with TZ=Asia/Kuala_Lumpur
> # Using beeline
> {code}
> create table test (d timestamp) stored as parquet;
> set hive.parquet.timestamp.write.legacy.conversion.enabled=true;
> insert into test values (cast("1900-01-01 00:00:00" as timestamp));
> select * from test;
> {code}
> # Run {{impala-shell.sh -Q timezone=Asia/Kuala_Lumpur -Q
> convert_legacy_hive_parquet_utc_timestamps=true -q 'select * from test'}}
> In this particular example, Asia/Kuala_Lumpur will either map to (LMT) or
> (SGT) depending on tzdata version. In either case, that time zone for 1900
> differs from the current UTC+8 timezone shift, so Impala shows a value that's
> off by ~1 hour.
> The Parquet file Hive writes in this case contains
> [key_value_metadata|https://parquet.apache.org/docs/file-format/metadata/]
> such that Hive can identify what conversion to use when reading the data, so
> newer Hive always handles these files correctly
> {code}
> writer.time.zone=Asia/Kuala_Lumpur
> writer.model.name=3.1.3000.7.1.7.1000-141
> writer.date.proleptic=false
> writer.zone.conversion.legacy=true
> {code}
> Impala could support the same behavior to be compatible with Hive by
> identifying the {{writer.zone.conversion.legacy}} flag and handing conversion
> to a SimpleDateFormat in Java ([Hive
> code|https://github.com/apache/hive/blob/rel/release-4.0.1/common/src/java/org/apache/hadoop/hive/common/type/TimestampTZUtil.java#L194-L201]).
> Impala also behaves differently with older files where
> {{writer.zone.conversion.legacy}} is not set. Hive's behavior is controlled
> by {{hive.parquet.timestamp.legacy.conversion.enabled}}, but Impala doesn't
> have a mode that uses SimpleDateFormat conversion. Impala's concept of
> "legacy" conversion is whether to assume UTC and convert to local timezone at
> all. It always uses tzdata for that conversion.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)