[jira] [Work logged] (HIVE-25104) Backward incompatible timestamp serialization in Parquet for certain timezones

ASF GitHub Bot (Jira) Mon, 24 May 2021 14:04:04 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-25104?focusedWorklogId=601399&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-601399
 ]


ASF GitHub Bot logged work on HIVE-25104:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 24/May/21 21:03
            Start Date: 24/May/21 21:03
    Worklog Time Spent: 10m 
      Work Description: jcamachor commented on a change in pull request #2282:
URL: https://github.com/apache/hive/pull/2282#discussion_r638140142



##########
File path: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
##########
@@ -2197,9 +2197,14 @@ private static void 
populateLlapDaemonVarsSet(Set<String> llapDaemonVarsSetLocal
     
HIVE_PARQUET_DATE_PROLEPTIC_GREGORIAN_DEFAULT("hive.parquet.date.proleptic.gregorian.default",
 false,
       "This value controls whether date type in Parquet files was written 
using the hybrid or proleptic\n" +
       "calendar. Hybrid is the default."),
-    
HIVE_PARQUET_TIMESTAMP_LEGACY_CONVERSION_ENABLED("hive.parquet.timestamp.legacy.conversion.enabled",
 true,

Review comment:
       Although this was marked as being used for debugging purposes only, if 
we are deprecating this property, we should try to fail when it is set 
explicitly? Otherwise, we may be changing results for users silently.
   
   If capturing the set command for specific property proves complicated, 
another option could be to use the same name for one of the properties (even if 
then they do not contain .*read.*/.*write.*).

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
##########
@@ -523,10 +532,30 @@ private static MessageType getRequestedPrunedSchema(
           configuration, 
HiveConf.ConfVars.HIVE_PARQUET_DATE_PROLEPTIC_GREGORIAN_DEFAULT)));
     }
 
-    String legacyConversion = 
ConfVars.HIVE_PARQUET_TIMESTAMP_LEGACY_CONVERSION_ENABLED.varname;
-    if (!metadata.containsKey(legacyConversion)) {
-      metadata.put(legacyConversion, String.valueOf(HiveConf.getBoolVar(
-          configuration, 
HiveConf.ConfVars.HIVE_PARQUET_TIMESTAMP_LEGACY_CONVERSION_ENABLED)));
+    if 
(!metadata.containsKey(DataWritableWriteSupport.WRITER_ZONE_CONVERSION_LEGACY)) 
{
+      final String legacyConversion;
+      
if(keyValueMetaData.containsKey(DataWritableWriteSupport.WRITER_ZONE_CONVERSION_LEGACY))
 {
+        // If there is meta about the legacy conversion then the file should 
be read in the same way it was written. 
+        legacyConversion = 
keyValueMetaData.get(DataWritableWriteSupport.WRITER_ZONE_CONVERSION_LEGACY);
+      } else 
if(keyValueMetaData.containsKey(DataWritableWriteSupport.WRITER_TIMEZONE)) {
+        // If there is no meta about the legacy conversion but there is meta 
about the timezone then we can infer the
+        // file was written with the new rules.
+        legacyConversion = "false";
+      } else {

Review comment:
       @zabetak , I guess you mean this block?
   `else 
if(keyValueMetaData.containsKey(DataWritableWriteSupport.WRITER_TIMEZONE))`
   
   The question is whether the default value for the config property is going 
to give the desired results for those users. I believe that's not the case, 
i.e., if the value is not there, we assume it was written applying conversions 
with old APIs? Many users would fall in that bucket (3.1.2, 3.2.0) so maybe it 
makes sense to have this clause to preserve behavior, as you have done.

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
##########
@@ -536,7 +542,8 @@ public void write(Object value) {
         Long int64value = ParquetTimestampUtils.getInt64(ts, timeUnit);
         recordConsumer.addLong(int64value);

Review comment:
       Isn't timestamp in INT64 Parquet already stored as UTC? I think we tried 
to keep all these conversions away from the new type.

##########
File path: 
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
##########
@@ -304,6 +305,14 @@ public static Boolean getWriterDateProleptic(Map<String, 
String> metadata) {
     return null;
   }
 
+  public static Boolean getWriterLegacyConversion(Map<String, String> 
metadata) {

Review comment:
       `getWriterLegacyConversion` -> `getWriterTimeZoneLegacyConversion`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 601399)
    Time Spent: 1h  (was: 50m)

> Backward incompatible timestamp serialization in Parquet for certain timezones
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-25104
>                 URL: https://issues.apache.org/jira/browse/HIVE-25104
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 3.1.0
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> HIVE-12192, HIVE-20007 changed the way that timestamp computations are 
> performed and to some extend how timestamps are serialized and deserialized 
> in files (Parquet, Avro, Orc).
> In versions that include HIVE-12192 or HIVE-20007 the serialization in 
> Parquet files is not backwards compatible. In other words writing timestamps 
> with a version of Hive that includes HIVE-12192/HIVE-20007 and reading them 
> with another (not including the previous issues) may lead to different 
> results depending on the default timezone of the system.
> Consider the following scenario where the default system timezone is set to 
> US/Pacific.
> At apache/master commit 37f13b02dff94e310d77febd60f93d5a205254d3
> {code:sql}
> CREATE EXTERNAL TABLE employee(eid INT,birth timestamp) STORED AS PARQUET
>  LOCATION '/tmp/hiveexttbl/employee';
> INSERT INTO employee VALUES (1, '1880-01-01 00:00:00');
> INSERT INTO employee VALUES (2, '1884-01-01 00:00:00');
> INSERT INTO employee VALUES (3, '1990-01-01 00:00:00');
> SELECT * FROM employee;
> {code}
> |1|1880-01-01 00:00:00|
> |2|1884-01-01 00:00:00|
> |3|1990-01-01 00:00:00|
> At apache/branch-2.3 commit 324f9faf12d4b91a9359391810cb3312c004d356
> {code:sql}
> CREATE EXTERNAL TABLE employee(eid INT,birth timestamp) STORED AS PARQUET
>  LOCATION '/tmp/hiveexttbl/employee';
> SELECT * FROM employee;
> {code}
> |1|1879-12-31 23:52:58|
> |2|1884-01-01 00:00:00|
> |3|1990-01-01 00:00:00|
> The timestamp for {{eid=1}} in branch-2.3 is different from the one in master.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (HIVE-25104) Backward incompatible timestamp serialization in Parquet for certain timezones

Reply via email to