[ 
https://issues.apache.org/jira/browse/HIVE-21002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16746753#comment-16746753
 ] 

Owen O'Malley commented on HIVE-21002:
--------------------------------------

The desired semantics for SQL and therefore Hive, are that timestamp is local 
(ie. timestamp without timezone). Parquet had non-standard semantics for 
timestamp and thus we need to minimize the pain to users while still making 
Hive's use of Parquet implement the standard semantics.

I suspect that most of the users read & write the data in the same time zone, 
which makes the problem less severe. I'd recommend adding an annotation to the 
Parquet file that indicates the writers time zone (eg. "America/Los_Angeles") 
and then using that information to readjust each timestamp. This would handle:

 * Reader: old, writer: new, time zone: same
 * Reader: old, writer: old, time zone: same
 * Reader: new, writer: new, time zone: same or different
 * Reader: new, writer: old, time zone: same

Clearly we should push the reader and writer patch back to each branch of Hive 
that we care about. It would be good to use the isAdjustedToUtc=true for the 
timestamp with local time zone in the Hive 3 code.

> Backwards incompatible change: Hive 3.1 reads back Avro and Parquet 
> timestamps written by Hive 2.x incorrectly
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-21002
>                 URL: https://issues.apache.org/jira/browse/HIVE-21002
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 3.1.0, 3.1.1
>            Reporter: Zoltan Ivanfi
>            Priority: Major
>
> Hive 3.1 reads back Avro and Parquet timestamps written by Hive 2.x 
> incorrectly. As an example session to demonstrate this problem, create a 
> dataset using Hive version 2.x in America/Los_Angeles:
> {code:sql}
> hive> create table ts_‹format› (ts timestamp) stored as ‹format›;
> hive> insert into ts_‹format› values (*‘2018-01-01 00:00:00.000’*);
> {code}
> Querying this table by issuing
> {code:sql}
> hive> select * from ts_‹format›;
> {code}
> from different time zones using different versions of Hive and different 
> storage formats gives the following results:
> |‹format›|Writer time zone (in Hive 2.x)|Reader time zone|Result in Hive 2.x 
> reader|Result in Hive 3.1 reader|
> |Avro and Parquet|America/Los_Angeles|America/Los_Angeles|2018-01-01 
> *00*:00:00.0|2018-01-01 *08*:00:00.0|
> |Avro and Parquet|America/Los_Angeles|Europe/Paris|2018-01-01 
> *09*:00:00.0|2018-01-01 *08*:00:00.0|
> |Textfile and ORC|America/Los_Angeles|America/Los_Angeles|2018-01-01 
> 00:00:00.0|2018-01-01 00:00:00.0|
> |Textfile and ORC|America/Los_Angeles|Europe/Paris|2018-01-01 
> 00:00:00.0|2018-01-01 00:00:00.0|
> *Hive 3.1 clearly gives different results than Hive 2.x for timestamps stored 
> in Avro and Parquet formats.* Apache ORC behaviour has not changed because it 
> was modified to adjust timestamps to retain backwards compatibility. Textfile 
> behaviour has not changed, because its processing involves parsing and 
> formatting instead of proper serializing and deserializing, so they 
> inherently had LocalDateTime semantics even in Hive 2.x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to