Hello Riza Suminto, Csaba Ringhofer, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/22293
to look at the new patch set (#15).
Change subject: IMPALA-13627: Handle legacy Hive timezone conversion
......................................................................
IMPALA-13627: Handle legacy Hive timezone conversion
After HIVE-12191, Hive has 2 different methods of calculating timestamp
conversion from UTC to local timezone. When Impala has
convert_legacy_hive_parquet_utc_timestamps=true, it assumes times
written by Hive are in UTC and converts them to local time using tzdata,
which matches the newer method introduced by HIVE-12191.
Some dates convert differently between the two methods, such as
Asia/Kuala_Lumpur or Singapore prior to 1982 (also seen in HIVE-24074).
After HIVE-25104, Hive writes 'writer.zone.conversion.legacy' to
distinguish which method is being used. As a result there are three
different cases we have to handle:
- Hive prior to 3.1 used what’s now called “legacy conversion” using
SimpleDateFormat.
- Hive 3.1.2 (with HIVE-21290) used a new Java API that’s based on
tzdata and added metadata to identify the timezone.
- Hive 4 support both, and added a new file metadata to identify it.
Adds handling for Hive files (identified by created_by=parquet-mr) where
we can infer the correct handling from Parquet file metadata:
- if writer.zone.conversion.legacy is present (Hive 4), use it to
determine whether to use a legacy conversion method compatible with
Hive's legacy behavior, or convert using tzdata.
- if writer.zone.conversion.legacy is not present but writer.time.zone
is, we can infer it was written by Hive 3.1.2+ using new APIs.
- otherwise it was likely written by an earlier Hive version.
Adds a new CLI and query option - use_legacy_hive_timestamp_conversion -
to select what conversion method to use when an old Hive file is
detected. Defaults to false to minimize changes in Impala's behavior and
because going through JNI is ~50x slower even when the results would not
differ; Hive defaults to true for its equivalent setting:
hive.parquet.timestamp.legacy.conversion.enabled.
Hive legacy-compatible conversion uses a Java method that would be
complicated to mimic in C++, doing
DateFormat formatter = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
formatter.setTimeZone(TimeZone.getTimeZone(timezone_string));
java.util.Date date = formatter.parse(date_time_string);
formatter.setTimeZone(TimeZone.getTimeZone("UTC"));
return out.println(formatter.format(date);
IMPALA-9385 added a check against a Timezone pointer in
FromUnixTimestamp. That dominates the time in FromUnixTimeNanos,
overriding any benchmark gains from IMPALA-7417. Moves FromUnixTime to
allow inlining, and switches to using UTCPTR in the benchmark - as
IMPALA-9385 did in most other code - to restore benchmark results.
Testing:
- Adds JVM conversion method to convert-timestamp-benchmark.
- Adds tests for several cases from Hive conversion tests.
Change-Id: I1271ed1da0b74366ab8315e7ec2d4ee47111e067
---
M be/src/benchmarks/convert-timestamp-benchmark.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-common.cc
M be/src/exec/parquet/parquet-common.h
M be/src/runtime/timestamp-value.cc
M be/src/runtime/timestamp-value.h
M be/src/runtime/timestamp-value.inline.h
M be/src/service/frontend.cc
M be/src/service/frontend.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/Frontend.thrift
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M docs/topics/impala_timestamp.xml
M fe/src/main/java/org/apache/impala/service/JniFrontend.java
A testdata/data/employee_hive_3_1_3_us_pacific.parquet
A testdata/data/hive_kuala_lumpur_legacy.parquet
A testdata/data/tbl_parq1/000000_0
A testdata/data/tbl_parq1/000000_1
A testdata/data/tbl_parq1/000000_2
A
testdata/workloads/functional-query/queries/QueryTest/timestamp-conversion-hive-313.test
A
testdata/workloads/functional-query/queries/QueryTest/timestamp-conversion-hive-3m.test
A
testdata/workloads/functional-query/queries/QueryTest/timestamp-conversion-hive-4.test
A tests/query_test/test_hive_timestamp_conversion.py
26 files changed, 510 insertions(+), 103 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/22293/15
--
To view, visit http://gerrit.cloudera.org:8080/22293
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1271ed1da0b74366ab8315e7ec2d4ee47111e067
Gerrit-Change-Number: 22293
Gerrit-PatchSet: 15
Gerrit-Owner: Michael Smith <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>