Hello Riza Suminto, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/22293

to look at the new patch set (#13).

Change subject: IMPALA-13627: Handle legacy Hive timezone conversion
......................................................................

IMPALA-13627: Handle legacy Hive timezone conversion

After HIVE-12191, Hive has 2 different methods of calculating timestamp
conversion from UTC to local timezone. When Impala has
convert_legacy_hive_parquet_utc_timestamps=true, it assumes times
written by Hive are in UTC and converts them to local time using tzdata,
which matches the newer method introduced by HIVE-12191.

Some dates convert differently between the two methods, such as
Asia/Kuala_Lumpur or Singapore prior to 1982 (also seen in HIVE-24074).
After HIVE-25104, Hive writes 'writer.zone.conversion.legacy' to
distinguish which method is being used. As a result there are three
different cases we have to handle:
- Hive prior to 3.1 used what’s now called “legacy conversion” using
  SimpleDateFormat.
- Hive 3.1.2 (with HIVE-21290) used a new Java API that’s based on
  tzdata and added metadata to identify the timezone.
- Hive 4 support both, and added a new file metadata to identify it.

Adds handling for Hive files (identified by created_by=parquet-mr) where
we can infer the correct handling from Parquet file metadata:
- if writer.zone.conversion.legacy is present (Hive 4), use it to
  determine whether to use a legacy conversion method compatible with
  Hive's legacy behavior, or convert using tzdata.
- if writer.zone.conversion.legacy is not present but writer.time.zone
  is, we can infer it was written by Hive 3.1.2+ using new APIs.
- otherwise it was likely written by an earlier Hive version.

Adds a new CLI and query option - use_legacy_hive_timestamp_conversion -
to select what conversion method to use when an old Hive file is
detected. Defaults to false to minimize changes in Impala's behavior and
because going through JNI is ~50x slower even when the results would not
differ; Hive defaults to true for its equivalent setting:
hive.parquet.timestamp.legacy.conversion.enabled.

Hive legacy-compatible conversion uses a Java method that would be
complicated to mimic in C++, doing

  DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
  formatter.setTimeZone(TimeZone.getTimeZone(timezone_string));
  java.util.Date date = formatter.parse(date_time_string);
  formatter.setTimeZone(TimeZone.getTimeZone("UTC"));
  return out.println(formatter.format(date);

Testing:
- Adds JVM conversion method to convert-timestamp-benchmark.
- Adds tests for several cases from Hive conversion tests.

Change-Id: I1271ed1da0b74366ab8315e7ec2d4ee47111e067
---
M be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/service/frontend.cc
M be/src/service/frontend.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M docs/topics/impala_timestamp.xml
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
A testdata/data/employee_hive_3_1_3_us_pacific.parquet
A testdata/data/hive_kuala_lumpur_legacy.parquet
A testdata/data/tbl_parq1/000000_0
A testdata/data/tbl_parq1/000000_1
A testdata/data/tbl_parq1/000000_2
A 
testdata/workloads/functional-query/queries/QueryTest/timestamp-conversion-hive-313.test
A 
testdata/workloads/functional-query/queries/QueryTest/timestamp-conversion-hive-3m.test
A 
testdata/workloads/functional-query/queries/QueryTest/timestamp-conversion-hive-4.test
A tests/query_test/test_hive_timestamp_conversion.py
25 files changed, 502 insertions(+), 91 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/22293/13
--
To view, visit http://gerrit.cloudera.org:8080/22293
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1271ed1da0b74366ab8315e7ec2d4ee47111e067
Gerrit-Change-Number: 22293
Gerrit-PatchSet: 13
Gerrit-Owner: Michael Smith <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>

Reply via email to