[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Csaba Ringhofer (Code Review) Fri, 01 Mar 2019 05:35:42 -0800

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12247 )


Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet
......................................................................


Patch Set 8:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39
PS8, Line 39: without conversion to UTC
> No, "pure" TIMESTAMP and TIMESTAMP WITHOUT TIME ZONE shall not be normalize
The main problem with normalizing to UTC from Impala's standpoint is the 
performance cost of the UTC->localtime conversion when reading it back. Doing 
the timezone conversion for all timestamps in all rows is very costly compared 
to other tasks during Parquet scanning.


http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39
PS8, Line 39: tha
> nit: the
Done


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc@579
PS8, Line 579: result_
> What about deleting the member 'result_', and only have it here as a local
result_ shouldn't be a local variable, as we return a pointer to it. This is 
the expected interface by BaseColumnWriter::AppendRow() - it has no template 
for the type of the column, so it handles values with a void* that points to 
the value that should be inserted to the current row, and is expected to live 
until we step to the next row.


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.h
File be/src/exec/parquet/parquet-metadata-utils.h:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.h@60
PS8, Line 60: Return
> nit: Returns
Done


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc
File be/src/exec/parquet/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142
PS8, Line 142: /// converted_type is not set because Impala always writes 
timestamps without UTC
> Parquet supports both UTC-normalized and timezone-agnostic timestamps, aka
As far as I know Parquet-MR does not do any timezone conversion and leaves this 
task to the caller, e.g. Hive.


http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h
File be/src/runtime/timestamp-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h@154
PS8, Line 154: kudu::int128_t nanos128 =
             :     static_cast<kudu::int128_t>(unixtime_seconds) * NANOS_PER_SEC
             :     + time_.fractional_seconds();
             :
             :   if (nanos128 <  std::numeric_limits<int64_t>::min()
             :       || nanos128 >  std::numeric_limits<int64_t>::max()) return 
false;
> I think we can still avoid using int128_t.
I created a ticket for benchmarking and optimizing these new functions: 
IMPALA-8268  Performance is not too important at the moment as 
parquet_timestamp_type is a development query option and should be mainly used 
to test whether the new timestamps can be read by other Hadoop components.

I think that the most costly things in this function are 
time_.fractional_seconds() and UtcToUnixTime()'s time_.total_seconds(), as 
these need int64 integer division. These could be avoided by not using 
UtcToUnixTime and converting to nanoseconds from day_ + time_ directly, but 
this has to be done carefully near the edge values.


http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test:

http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test@102
PS8, Line 102: ---- QUERY
             : create table int96_nanos (ts timestamp) stored as parquet;
             : ====
             : ---- QUERY
             : # Insert edge values as "normal" int96 timestamps that can 
represent all values.
             : set parquet_timestamp_type=INT96_NANOS;
             : insert into int96_nanos values
> nit: you dont't need to start a new QUERY block for each query when you don
Thanks for the tip!



--
To view, visit http://gerrit.cloudera.org:8080/12247
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f
Gerrit-Change-Number: 12247
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Zoltan Ivanfi <[email protected]>
Gerrit-Comment-Date: Fri, 01 Mar 2019 13:35:32 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5051: Add INT64 timestamp write support in Parquet

Reply via email to