[jira] [Commented] (HIVE-25104) Backward incompatible timestamp serialization in Parquet for certain timezones

Stamatis Zampetakis (Jira) Fri, 14 May 2021 05:05:06 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344550#comment-17344550
 ]


Stamatis Zampetakis commented on HIVE-25104:
--------------------------------------------

Keeping both conversion rules brings up another question that is how we decide 
to read/write data. 

Let's assume that we have the following properties:
* {{hive.parquet.timestamp.read.legacy.conversion.enabled}}, whether we should 
apply legacy conversion rules when we read the data from the file
* {{hive.parquet.timestamp.write.legacy.conversion.enabled}}, whether we should 
apply legacy conversion rules when write the data to the file

The question is if we should exploit the file metadata to include information 
about the conversion or not.  

+Include conversion status in metadata+
If the status is present in the metadata then we don't need to rely on the 
value of the read property to apply the appropriate conversion rules when we 
read the data. Essentially, the value of the respective conf property is 
ignored the read property becomes a noop. External applications reading parquet 
files written by Hive know exactly how they should interpret the data to 
display the expected results. 

The disadvantage of this approach is that once the file is written a Hive user 
has no way to interpret the data using other conversion rules so it is very 
important to set the write property correctly in the beginning. 

+Exclude conversion status from metadata+
If the status is not present in the metadata then reading and writing depends 
entirely on the values of the respective properties. This approach is more 
forgiving if somebody sets the write property incorrectly but it makes it 
impossible for Hive and other tools to know the conversion rules that were 
applied in the data when they were written in the files. Furthermore, reading 
and writing timestamps differently in a every query make s the overall approach 
very prone to errors.

Summing up the best option is to record the conversion status in the file 
metadata on write. If this information is available on read then Hive will use 
it to choose the proper way to read the data. If it is not then Hive will read 
the data using the value of read config property. 


> Backward incompatible timestamp serialization in Parquet for certain timezones
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-25104
>                 URL: https://issues.apache.org/jira/browse/HIVE-25104
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 3.1.2
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>
> HIVE-12192, HIVE-20007 changed the way that timestamp computations are 
> performed and to some extend how timestamps are serialized and deserialized 
> in files (Parquet, Avro, Orc).
> In versions that include HIVE-12192 or HIVE-20007 the serialization in 
> Parquet files is not backwards compatible. In other words writing timestamps 
> with a version of Hive that includes HIVE-12192/HIVE-20007 and reading them 
> with another (not including the previous issues) may lead to different 
> results depending on the default timezone of the system.
> Consider the following scenario where the default system timezone is set to 
> US/Pacific.
> At apache/master commit 37f13b02dff94e310d77febd60f93d5a205254d3
> {code:sql}
> CREATE EXTERNAL TABLE employee(eid INT,birth timestamp) STORED AS PARQUET
>  LOCATION '/tmp/hiveexttbl/employee';
> INSERT INTO employee VALUES (1, '1880-01-01 00:00:00');
> INSERT INTO employee VALUES (2, '1884-01-01 00:00:00');
> INSERT INTO employee VALUES (3, '1990-01-01 00:00:00');
> SELECT * FROM employee;
> {code}
> |1|1880-01-01 00:00:00|
> |2|1884-01-01 00:00:00|
> |3|1990-01-01 00:00:00|
> At apache/branch-2.3 commit 324f9faf12d4b91a9359391810cb3312c004d356
> {code:sql}
> CREATE EXTERNAL TABLE employee(eid INT,birth timestamp) STORED AS PARQUET
>  LOCATION '/tmp/hiveexttbl/employee';
> SELECT * FROM employee;
> {code}
> |1|1879-12-31 23:52:58|
> |2|1884-01-01 00:00:00|
> |3|1990-01-01 00:00:00|
> The timestamp for {{eid=1}} in branch-2.3 is different from the one in master.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-25104) Backward incompatible timestamp serialization in Parquet for certain timezones

Reply via email to