[
https://issues.apache.org/jira/browse/IMPALA-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964206#comment-16964206
]
Csaba Ringhofer edited comment on IMPALA-3933 at 10/31/19 4:54 PM:
-------------------------------------------------------------------
[[email protected]]
I checked dates with close to ~asf-master Impala and CDH/CDP Hives. We work
differently depending on the Hive version.
{code}
>From Hive:
create table tdate (d date) stored as parquet;
insert into table tdate values ("0001-01-01"), ("1400-01-01"), ("1500-01-01"),
("1800-01-01");
>From Impala:
invalidate metadata tdate;
select * from tdata;
When the data was inserted with CDP Hive, we return all values correctly:
+------------+
| d |
+------------+
| 0001-01-01 |
| 1400-01-01 |
| 1500-01-01 |
| 1800-01-01 |
+------------+
With CDH Hive, the very old dates are shifted, probably related to Julian vs
Proleptic Gregorian interpretation of old dates:
+------------+
| d |
+------------+
| NULL |
| 1400-01-09 |
| 1500-01-10 |
| 1800-01-01 |
+------------+
WARNINGS: Parquet file 'hdfs://localhost:20500/test-warehouse/tdate/000000_0'
column 'd' contains an out of range date. The valid date range is
0001-01-01..9999-12-31.
{code}
So dates are also problematic with CDH Hive, but it is a different problem than
the one described in the description of the Jira. The original issue is about
historical timezone rules, which do not affect dates, but very old dates are
still affected by different Julian/Gregorian handling. I think Hive switched to
Proleptic Gregorian in Hive 3.1. so it is similar to Impal from that point.
was (Author: csringhofer):
[[email protected]]
I checked dates with clode to ~asf-master Impala and CDH/CDP Hives. We work
differently depending on the Hive version.
{code}
>From Hive:
create table tdate (d date) stored as parquet;
insert into table tdate values ("0001-01-01"), ("1400-01-01"), ("1500-01-01"),
("1800-01-01");
>From Impala:
invalidate metadata tdate;
select * from tdata;
When the data was inserted with CDP Hive, we return all values correctly:
+------------+
| d |
+------------+
| 0001-01-01 |
| 1400-01-01 |
| 1500-01-01 |
| 1800-01-01 |
+------------+
With CDH Hive, the very old dates are shifted, probably related to Julian vs
Proleptic Gregorian interpretation of old dates:
+------------+
| d |
+------------+
| NULL |
| 1400-01-09 |
| 1500-01-10 |
| 1800-01-01 |
+------------+
WARNINGS: Parquet file 'hdfs://localhost:20500/test-warehouse/tdate/000000_0'
column 'd' contains an out of range date. The valid date range is
0001-01-01..9999-12-31.
{code}
So dates are also problematic with CDH Hive, but it is a different problem than
the one described in the description of the Jira. The original issue is about
historical timezone rules, which do not affect dates, but very old dates are
still affected by different Julian/Gregorian handling. I think Hive switched to
Proleptic Gregorian in Hive 3.1. so it is similar to Impal from that point.
> Time zone definitions of Hive/Spark and Impala differ for historical dates
> --------------------------------------------------------------------------
>
> Key: IMPALA-3933
> URL: https://issues.apache.org/jira/browse/IMPALA-3933
> Project: IMPALA
> Issue Type: New Feature
> Components: Backend
> Affects Versions: impala 2.3
> Reporter: Adriano Simone
> Priority: Minor
>
> How the TIMESTAMP skew with convert_legacy_hive_parquet_utc_timestamps=true
> Enabling --convert_legacy_hive_parquet_utc_timestamps=true seems to cause
> data skew (improper converting) upon the reading for dates earlier than 1900
> (not sure about the exact date).
> The following example was run on a server which is in CEST timezone, thus the
> time difference is GMT+1 for dates before 1900 (I'm not sure, I haven't
> checked the exact starting date of DST computation), and GMT+2 when summer
> daylight saving time was applied.
> create table itst (col1 int, myts timestamp) stored as parquet;
> From impala:
> {code:java}
> insert into itst values (1,'2016-04-15 12:34:45');
> insert into itst values (2,'1949-04-15 12:34:45');
> insert into itst values (3,'1753-04-15 12:34:45');
> insert into itst values (4,'1752-04-15 12:34:45');
> {code}
> from hive
> {code:java}
> insert into itst values (5,'2016-04-15 12:34:45');
> insert into itst values (6,'1949-04-15 12:34:45');
> insert into itst values (7,'1753-04-15 12:34:45');
> insert into itst values (8,'1752-04-15 12:34:45');
> {code}
> From impala
> {code:java}
> select * from itst order by col1;
> {code}
> Result:
> {code:java}
> Query: select * from itst
> +------+---------------------+
> | col1 | myts |
> +------+---------------------+
> | 1 | 2016-04-15 12:34:45 |
> | 2 | 1949-04-15 12:34:45 |
> | 3 | 1753-04-15 12:34:45 |
> | 4 | 1752-04-15 12:34:45 |
> | 5 | 2016-04-15 10:34:45 |
> | 6 | 1949-04-15 10:34:45 |
> | 7 | 1753-04-15 11:34:45 |
> | 8 | 1752-04-15 11:34:45 |
> +------+---------------------+
> {code}
> The timestamps are looking good, the DST differences can be seen (hive
> inserted it in local time, but impala shows it in UTC)
> From impala after setting the command line argument
> "--convert_legacy_hive_parquet_utc_timestamps=true"
> {code:java}
> select * from itst order by col1;
> {code}
> The result in this case:
> {code:java}
> Query: select * from itst order by col1
> +------+---------------------+
> | col1 | myts |
> +------+---------------------+
> | 1 | 2016-04-15 12:34:45 |
> | 2 | 1949-04-15 12:34:45 |
> | 3 | 1753-04-15 12:34:45 |
> | 4 | 1752-04-15 12:34:45 |
> | 5 | 2016-04-15 12:34:45 |
> | 6 | 1949-04-15 12:34:45 |
> | 7 | 1753-04-15 12:51:05 |
> | 8 | 1752-04-15 12:51:05 |
> +------+---------------------+
> {code}
> It seems that instead of 11:34:45 it is showing 12:51:05.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]