[
https://issues.apache.org/jira/browse/IMPALA-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967675#comment-16967675
]
Attila Jeges commented on IMPALA-3933:
--------------------------------------
[[email protected]] The Java TZ database and the IANA TZ database (used by
the OS) have different binary formats, making Impala use the Java TZ database
is not a trivial task.
We could package an IANA TZ database that is compatible with the current
version of the Java TZ database and make it publicly available for Impala
users. The problem with this approach is that timezone rules change frequently
and Java's TZ db gets updated from time to time (when the admin runs system
update) and then we will be out of sync again.
I haven't tested it yet but there's a tool to convert the IANA TZ database to
Java's TZ database: https://github.com/akashche/tzdbgen . Perhaps we should
point users to this (or a similar) tool if they want to keep the 2 databases in
sync.
> Time zone definitions of Hive/Spark and Impala differ for historical dates
> --------------------------------------------------------------------------
>
> Key: IMPALA-3933
> URL: https://issues.apache.org/jira/browse/IMPALA-3933
> Project: IMPALA
> Issue Type: New Feature
> Components: Backend
> Affects Versions: impala 2.3
> Reporter: Adriano Simone
> Priority: Minor
>
> How the TIMESTAMP skew with convert_legacy_hive_parquet_utc_timestamps=true
> Enabling --convert_legacy_hive_parquet_utc_timestamps=true seems to cause
> data skew (improper converting) upon the reading for dates earlier than 1900
> (not sure about the exact date).
> The following example was run on a server which is in CEST timezone, thus the
> time difference is GMT+1 for dates before 1900 (I'm not sure, I haven't
> checked the exact starting date of DST computation), and GMT+2 when summer
> daylight saving time was applied.
> create table itst (col1 int, myts timestamp) stored as parquet;
> From impala:
> {code:java}
> insert into itst values (1,'2016-04-15 12:34:45');
> insert into itst values (2,'1949-04-15 12:34:45');
> insert into itst values (3,'1753-04-15 12:34:45');
> insert into itst values (4,'1752-04-15 12:34:45');
> {code}
> from hive
> {code:java}
> insert into itst values (5,'2016-04-15 12:34:45');
> insert into itst values (6,'1949-04-15 12:34:45');
> insert into itst values (7,'1753-04-15 12:34:45');
> insert into itst values (8,'1752-04-15 12:34:45');
> {code}
> From impala
> {code:java}
> select * from itst order by col1;
> {code}
> Result:
> {code:java}
> Query: select * from itst
> +------+---------------------+
> | col1 | myts |
> +------+---------------------+
> | 1 | 2016-04-15 12:34:45 |
> | 2 | 1949-04-15 12:34:45 |
> | 3 | 1753-04-15 12:34:45 |
> | 4 | 1752-04-15 12:34:45 |
> | 5 | 2016-04-15 10:34:45 |
> | 6 | 1949-04-15 10:34:45 |
> | 7 | 1753-04-15 11:34:45 |
> | 8 | 1752-04-15 11:34:45 |
> +------+---------------------+
> {code}
> The timestamps are looking good, the DST differences can be seen (hive
> inserted it in local time, but impala shows it in UTC)
> From impala after setting the command line argument
> "--convert_legacy_hive_parquet_utc_timestamps=true"
> {code:java}
> select * from itst order by col1;
> {code}
> The result in this case:
> {code:java}
> Query: select * from itst order by col1
> +------+---------------------+
> | col1 | myts |
> +------+---------------------+
> | 1 | 2016-04-15 12:34:45 |
> | 2 | 1949-04-15 12:34:45 |
> | 3 | 1753-04-15 12:34:45 |
> | 4 | 1752-04-15 12:34:45 |
> | 5 | 2016-04-15 12:34:45 |
> | 6 | 1949-04-15 12:34:45 |
> | 7 | 1753-04-15 12:51:05 |
> | 8 | 1752-04-15 12:51:05 |
> +------+---------------------+
> {code}
> It seems that instead of 11:34:45 it is showing 12:51:05.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]