[
https://issues.apache.org/jira/browse/HIVE-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17344550#comment-17344550
]
Stamatis Zampetakis commented on HIVE-25104:
--------------------------------------------
Keeping both conversion rules brings up another question that is how we decide
to read/write data.
Let's assume that we have the following properties:
* {{hive.parquet.timestamp.read.legacy.conversion.enabled}}, whether we should
apply legacy conversion rules when we read the data from the file
* {{hive.parquet.timestamp.write.legacy.conversion.enabled}}, whether we should
apply legacy conversion rules when write the data to the file
The question is if we should exploit the file metadata to include information
about the conversion or not.
+Include conversion status in metadata+
If the status is present in the metadata then we don't need to rely on the
value of the read property to apply the appropriate conversion rules when we
read the data. Essentially, the value of the respective conf property is
ignored the read property becomes a noop. External applications reading parquet
files written by Hive know exactly how they should interpret the data to
display the expected results.
The disadvantage of this approach is that once the file is written a Hive user
has no way to interpret the data using other conversion rules so it is very
important to set the write property correctly in the beginning.
+Exclude conversion status from metadata+
If the status is not present in the metadata then reading and writing depends
entirely on the values of the respective properties. This approach is more
forgiving if somebody sets the write property incorrectly but it makes it
impossible for Hive and other tools to know the conversion rules that were
applied in the data when they were written in the files. Furthermore, reading
and writing timestamps differently in a every query make s the overall approach
very prone to errors.
Summing up the best option is to record the conversion status in the file
metadata on write. If this information is available on read then Hive will use
it to choose the proper way to read the data. If it is not then Hive will read
the data using the value of read config property.
> Backward incompatible timestamp serialization in Parquet for certain timezones
> ------------------------------------------------------------------------------
>
> Key: HIVE-25104
> URL: https://issues.apache.org/jira/browse/HIVE-25104
> Project: Hive
> Issue Type: Bug
> Components: Serializers/Deserializers
> Affects Versions: 3.1.2
> Reporter: Stamatis Zampetakis
> Assignee: Stamatis Zampetakis
> Priority: Major
>
> HIVE-12192, HIVE-20007 changed the way that timestamp computations are
> performed and to some extend how timestamps are serialized and deserialized
> in files (Parquet, Avro, Orc).
> In versions that include HIVE-12192 or HIVE-20007 the serialization in
> Parquet files is not backwards compatible. In other words writing timestamps
> with a version of Hive that includes HIVE-12192/HIVE-20007 and reading them
> with another (not including the previous issues) may lead to different
> results depending on the default timezone of the system.
> Consider the following scenario where the default system timezone is set to
> US/Pacific.
> At apache/master commit 37f13b02dff94e310d77febd60f93d5a205254d3
> {code:sql}
> CREATE EXTERNAL TABLE employee(eid INT,birth timestamp) STORED AS PARQUET
> LOCATION '/tmp/hiveexttbl/employee';
> INSERT INTO employee VALUES (1, '1880-01-01 00:00:00');
> INSERT INTO employee VALUES (2, '1884-01-01 00:00:00');
> INSERT INTO employee VALUES (3, '1990-01-01 00:00:00');
> SELECT * FROM employee;
> {code}
> |1|1880-01-01 00:00:00|
> |2|1884-01-01 00:00:00|
> |3|1990-01-01 00:00:00|
> At apache/branch-2.3 commit 324f9faf12d4b91a9359391810cb3312c004d356
> {code:sql}
> CREATE EXTERNAL TABLE employee(eid INT,birth timestamp) STORED AS PARQUET
> LOCATION '/tmp/hiveexttbl/employee';
> SELECT * FROM employee;
> {code}
> |1|1879-12-31 23:52:58|
> |2|1884-01-01 00:00:00|
> |3|1990-01-01 00:00:00|
> The timestamp for {{eid=1}} in branch-2.3 is different from the one in master.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)