Re: Review Request 41821: HIVE-12767: Implement table property to address Parquet int96 timestamp bug

Ryan Blue Fri, 22 Jan 2016 15:16:21 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41821/#review115731
-----------------------------------------------------------





ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java 
(line 139)
<https://reviews.apache.org/r/41821/#comment176776>

    One of the test cases listed in the scope doc is when the zone ID is 
unknown, like "Antarctica/South_Pole", that isn't valid. In that case the table 
should not be readable because the correct zone offset can't be determined.
    
    `TimeZone.getTimeZone(String)` will return GMT when the zone id isn't 
recognized, so I'm concerned that this will do the wrong thing and use GMT in 
some cases where the zone should be rejected.
    
    See 
https://docs.oracle.com/javase/7/docs/api/java/util/TimeZone.html#getTimeZone(java.lang.String)
    You might need to make a set of available IDs using 
https://docs.oracle.com/javase/7/docs/api/java/util/TimeZone.html#getAvailableIDs()



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java 
(line 168)
<https://reviews.apache.org/r/41821/#comment176778>

    It looks like this removes support for 
`HiveConf.ConfVars.HIVE_PARQUET_TIMESTAMP_SKIP_CONVERSION.varname`. I think we 
should maintain support for that configuration setting since customers may 
already be using it and we don't want to change behavior. If it is set to true, 
it should cause the calendar to use UTC (so no adjustment is made). Howevre, 
the new table property should override the Hive setting. So I think the check 
should be in the Strings.isNullOrEmpty case.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
 (line 57)
<https://reviews.apache.org/r/41821/#comment176779>

    Nit: We use UTC elsewhere. I believe that they are identical for our 
purposes, but I want to reduce confusion and would recommend UTC to GMT.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
 (line 211)
<https://reviews.apache.org/r/41821/#comment176783>

    The file metadata is also passed in [via 
InitContext](https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java#L173).
 If the file has the write zone property set, then it should be validated 
against the configured write zone (if set). When present, the file's value 
should override the table value and there should be a warning if the table 
value doesn't match the file value. That covers cases where files are moved 
from one table to another.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java
 (line 161)
<https://reviews.apache.org/r/41821/#comment176785>

    Shouldn't this use UTC, which will apply an offset of 0?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java
 (line 165)
<https://reviews.apache.org/r/41821/#comment176787>

    When will the zone property be set before this method? Is this how the 
table property is passed in?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java 
(line 21)
<https://reviews.apache.org/r/41821/#comment176788>

    I think calling this the default write zone is not quite correct. It is the 
default zone when the Hive config property is set, but the current zone is the 
default otherwise. What about calling this the PARQUET_INT96_NO_ADJUSTMENT_ZONE?



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java 
(line 53)
<https://reviews.apache.org/r/41821/#comment176792>

    This is a little odd. Normally, I would consider GMT/UTC to be the calendar 
that applies no adjustment. But what is happening in this code is that the UTC 
millisecond value is being constructed using values in that zone and then a new 
timestamp is created from it that is in the local zone. So local->local 
produces no adjustment, while GMT->local does.
    
    I was thinking that GMT would always be used to create the milliseconds, 
then the calendar would be used to adjust the value by some amount. UTC by 0, 
EST by -5, and PST by -8.
    
    I think it may be easier to have a method that does a single conversion to 
UTC in milliseconds. Then, adjust the value using a static offset like the one 
from `TimeZone.getOffset(utc_ms)`. That would result in a bit less work done 
here and I think would make the adjustment easier to reason about.



ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java
 (line 35)
<https://reviews.apache.org/r/41821/#comment176981>

    What is the purpose of this property? Is this just a way to pass the time 
zone? Why not use the property used elsewhere?



ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 
(line 227)
<https://reviews.apache.org/r/41821/#comment176983>

    It looks like this tests that the value written is the result of 
`getNanoTime(Timestamp,Calendar).toBinary()`, but doesn't test what that value 
is or should be. That is probably the intent, but I wanted to make sure.



ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java
 (line 214)
<https://reviews.apache.org/r/41821/#comment176985>

    This is a great test for round trip, but I think we should also have a test 
with a set of corrupt values from a file written with 5.5 in 
America/Los_Angeles (since that's convenient).


- Ryan Blue


On Jan. 13, 2016, 12:36 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/41821/
> -----------------------------------------------------------
> 
> (Updated Jan. 13, 2016, 12:36 p.m.)
> 
> 
> Review request for hive, Ryan Blue, Mohammad Islam, Reuben Kuhnert, and 
> Szehon Ho.
> 
> 
> Bugs: HIVE-12767
>     https://issues.apache.org/jira/browse/HIVE-12767
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The following exit criteria is addressed in this patch:
> 
> * Hive will read Parquet MR int96 timestamp data and adjust values using a 
> time zone from a table property, if set, or using the local time zone if it 
> is absent. No adjustment will be applied to data written by Impala.
> 
> * Hive will write Parquet int96 timestamps using a time zone adjustment from 
> the same table property, if set, or using the local time zone if it is 
> absent. This keeps the data in the table consistent.
> 
> * New tables created by Hive will set the table property to UTC if the global 
> option to set the property for new tables is enabled.
>   * Tables created using CREATE TABLE and CREATE TABLE LIKE FILE will not set 
> the property unless the global setting to do so is enabled.
>   * Tables created using CREATE TABLE LIKE <OTHER TABLE> will copy the 
> property of the table that is copied.
> 
> To set the timezone table property, use this:
>   create table tbl1 (ts timestamp) stored as parquet tblproperties 
> ('parquet.mr.int96.write.zone'='PST');
>   
> To set UTC as default timezone table property on new tables created, use 
> this: 
>   set parquet.mr.int96.enable.utc.write.zone=true;
>   create table tbl2 (ts timestamp) stored as parquet;
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 
> 1bcdc5f49e1a4a0f357842e88cf5fd359685b5ef 
>   data/files/impala_int96_timestamp.parq PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 
> deec8bba45c130c5dfdc482522c0825a71af9d2c 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetOutputFormat.java
>  bfb48a987ce89a373f3da63c9162546c6eda43a9 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ETypeConverter.java 
> ec0dd818f688ab92feb46be4fb6040ede5ac756a 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
>  53f3b72b790d87a75a7cd1d77d8f011c29c41188 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/ParquetRecordReaderWrapper.java
>  74a1a82047613189678716f765bfaa9ac39b7618 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetTableUtils.java 
> PRE-CREATION 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/timestamp/NanoTimeUtils.java 
> aace48ee7d145d199163286d21e4ee7694140d6f 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriteSupport.java
>  f4621e5dbb81e8d58c4572c901ec9d1a7ca8c012 
>   
> ql/src/java/org/apache/hadoop/hive/ql/io/parquet/write/DataWritableWriter.java
>  69272dc41dbc5fe29ab4c98e730b591c28f3a297 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java 
> 70491390ba2b90f32ef9963be7b19e57672241f3 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java
>  PRE-CREATION 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java
>  ec6def5b9ac5f12e6a7cb24c4f4998a6ca6b4a8e 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java
>  PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_int96_timestamp.q PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_int96_timestamp.q.out 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/41821/diff/
> 
> 
> Testing
> -------
> 
> Added unit and q-tests:
>   ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestDataWritableWriter.java
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/convert/TestETypeConverter.java
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/serde/TestParquetTimestampUtils.java
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/timestamp/TestParquetTimestampConverter.java
>   ql/src/test/queries/clientpositive/parquet_int96_timestamp.q
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>

Re: Review Request 41821: HIVE-12767: Implement table property to address Parquet int96 timestamp bug

Reply via email to