Hello Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/22293

to look at the new patch set (#5).

Change subject: IMPALA-13627: Handle legacy Hive timezone conversion
......................................................................

IMPALA-13627: Handle legacy Hive timezone conversion

After HIVE-12191, Hive has 2 different methods of calculating timestamp
conversion from UTC to local timezone. When Impala has
convert_legacy_hive_parquet_utc_timestamps=true, it assumes times
written by Hive are in UTC and converts them to local time using tzdata,
which matches the newer method introduced by HIVE-12191.

Some dates convert differently between the two methods, such as
Asia/Kuala_Lumpur or Singapore prior to 1982 (also seen in HIVE-24074).
After HIVE-25104, Hive writes 'writer.zone.conversion.legacy' to
distinguish which method is being used. As a result there are three
different cases we have to handle:
- Hive prior to 3.1 used what’s now called “legacy conversion” using
  SimpleDateFormat.
- Hive 3.1.3 used a new Java API that’s based on tzdata.
- Hive 4 support both, and added a new file metadata to identify it.

Adds handling for Hive files (identified by created_by=parquet-mr) where
we can infer the correct handling from Parquet file metadata:
- if writer.zone.conversion.legacy is present, use it to determine
  whether to use a legacy conversion method compatible with Hive's
  legacy behavior, or convert using tzdata.
- if writer.zone.conversion.legacy is not present but writer.time.zone
  is, we can infer it was written by Hive 3.1.3 using new APIs.
- otherwise it was likely written by an earlier Hive version.

Adds a new CLI and query option - use_legacy_hive_timestamp_conversion -
to select what conversion method to use when an old Hive file is
detected. Defaults to false to minimize changes in Impala's behavior and
because going through JNI will have significant impact on performance
even when the results would not differ; Hive defaults to true for its
equivalent setting: hive.parquet.timestamp.legacy.conversion.enabled.

Hive legacy-compatible conversion uses a Java method doing

  DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
  formatter.setTimeZone(TimeZone.getTimeZone(timezone_string));
  java.util.Date date = formatter.parse(date_time_string);
  formatter.setTimeZone(TimeZone.getTimeZone("UTC"));
  return out.println(formatter.format(date);

Testing:
- Adds JVM conversion method to convert-timestamp-benchmark.
- Adds tests for several cases from Hive conversion tests.

Change-Id: I1271ed1da0b74366ab8315e7ec2d4ee47111e067
---
M be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/frontend.cc
M be/src/service/frontend.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
A testdata/data/employee_hive_3_1_3_us_pacific.parquet
A testdata/data/hive_kuala_lumpur_legacy.parquet
A testdata/data/tbl_parq1/000000_0
A testdata/data/tbl_parq1/000000_1
A testdata/data/tbl_parq1/000000_2
A 
testdata/workloads/functional-query/queries/QueryTest/timestamp-conversion-hive-313.test
A 
testdata/workloads/functional-query/queries/QueryTest/timestamp-conversion-hive-3m.test
A 
testdata/workloads/functional-query/queries/QueryTest/timestamp-conversion-hive-4.test
A tests/query_test/test_hive_timestamp_conversion.py
24 files changed, 435 insertions(+), 89 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/22293/5
--
To view, visit http://gerrit.cloudera.org:8080/22293
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1271ed1da0b74366ab8315e7ec2d4ee47111e067
Gerrit-Change-Number: 22293
Gerrit-PatchSet: 5
Gerrit-Owner: Michael Smith <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>

Reply via email to