[ 
https://issues.apache.org/jira/browse/IMPALA-13627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Smith resolved IMPALA-13627.
------------------------------------
    Fix Version/s: Impala 4.6.0
       Resolution: Fixed

> Impala uses different timezone conversion when reading Hive with legacy 
> conversion
> ----------------------------------------------------------------------------------
>
>                 Key: IMPALA-13627
>                 URL: https://issues.apache.org/jira/browse/IMPALA-13627
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.4.1
>            Reporter: Michael Smith
>            Assignee: Michael Smith
>            Priority: Critical
>              Labels: avro, parquet
>             Fix For: Impala 4.6.0
>
>
> HIVE-12192, HIVE-20007 changed the way that timestamp computations are 
> performed and to some extend how timestamps are serialized and deserialized 
> in files (Parquet, Avro) by Hive. HIVE-25104 was added to allow Hive to 
> continue to write files using the legacy timestamp conversion so that older 
> Hive versions can read the correct time.
> All of that is background to say that Impala - when converting UTC timestamps 
> with {{convert_legacy_hive_parquet_utc_timestamps}} - does not mirror Hive's 
> timestamp conversion when reading INT96 coded timestamps that Hive converted 
> from local time to UTC using legacy timezone conversion. To reproduce
> # Start Hive with TZ=Asia/Kuala_Lumpur
> # Using beeline
> {code}
> create table test (d timestamp) stored as parquet;
> set hive.parquet.timestamp.write.legacy.conversion.enabled=true;
> insert into test values (cast("1900-01-01 00:00:00" as timestamp));
> select * from test;
> {code}
> # Run {{impala-shell.sh -Q timezone=Asia/Kuala_Lumpur -Q 
> convert_legacy_hive_parquet_utc_timestamps=true -q 'select * from test'}}
> In this particular example, Asia/Kuala_Lumpur will either map to (LMT) or 
> (SGT) depending on tzdata version. In either case, that time zone for 1900 
> differs from the current UTC+8 timezone shift, so Impala shows a value that's 
> off by ~1 hour.
> The Parquet file Hive writes in this case contains 
> [key_value_metadata|https://parquet.apache.org/docs/file-format/metadata/] 
> such that Hive can identify what conversion to use when reading the data, so 
> newer Hive always handles these files correctly
> {code}
> writer.time.zone=Asia/Kuala_Lumpur
> writer.model.name=3.1.3000.7.1.7.1000-141
> writer.date.proleptic=false
> writer.zone.conversion.legacy=true
> {code}
> Impala could support the same behavior to be compatible with Hive by 
> identifying the {{writer.zone.conversion.legacy}} flag and handing conversion 
> to a SimpleDateFormat in Java ([Hive 
> code|https://github.com/apache/hive/blob/rel/release-4.0.1/common/src/java/org/apache/hadoop/hive/common/type/TimestampTZUtil.java#L194-L201]).
> Impala also behaves differently with older files where 
> {{writer.zone.conversion.legacy}} is not set. Hive's behavior is controlled 
> by {{hive.parquet.timestamp.legacy.conversion.enabled}}, but Impala doesn't 
> have a mode that uses SimpleDateFormat conversion. Impala's concept of 
> "legacy" conversion is whether to assume UTC and convert to local timezone at 
> all. It always uses tzdata for that conversion.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to