Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12247 )
Change subject: IMPALA-5051: Add INT64 timestamp write support in Parquet ...................................................................... Patch Set 8: (7 comments) http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39 PS8, Line 39: without conversion to UTC > No, "pure" TIMESTAMP and TIMESTAMP WITHOUT TIME ZONE shall not be normalize The main problem with normalizing to UTC from Impala's standpoint is the performance cost of the UTC->localtime conversion when reading it back. Doing the timezone conversion for all timestamps in all rows is very costly compared to other tasks during Parquet scanning. http://gerrit.cloudera.org:8080/#/c/12247/8//COMMIT_MSG@39 PS8, Line 39: tha > nit: the Done http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc File be/src/exec/parquet/hdfs-parquet-table-writer.cc: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/hdfs-parquet-table-writer.cc@579 PS8, Line 579: result_ > What about deleting the member 'result_', and only have it here as a local result_ shouldn't be a local variable, as we return a pointer to it. This is the expected interface by BaseColumnWriter::AppendRow() - it has no template for the type of the column, so it handles values with a void* that points to the value that should be inserted to the current row, and is expected to live until we step to the next row. http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.h File be/src/exec/parquet/parquet-metadata-utils.h: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.h@60 PS8, Line 60: Return > nit: Returns Done http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/exec/parquet/parquet-metadata-utils.cc@142 PS8, Line 142: /// converted_type is not set because Impala always writes timestamps without UTC > Parquet supports both UTC-normalized and timezone-agnostic timestamps, aka As far as I know Parquet-MR does not do any timezone conversion and leaves this task to the caller, e.g. Hive. http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h File be/src/runtime/timestamp-value.inline.h: http://gerrit.cloudera.org:8080/#/c/12247/8/be/src/runtime/timestamp-value.inline.h@154 PS8, Line 154: kudu::int128_t nanos128 = : static_cast<kudu::int128_t>(unixtime_seconds) * NANOS_PER_SEC : + time_.fractional_seconds(); : : if (nanos128 < std::numeric_limits<int64_t>::min() : || nanos128 > std::numeric_limits<int64_t>::max()) return false; > I think we can still avoid using int128_t. I created a ticket for benchmarking and optimizing these new functions: IMPALA-8268 Performance is not too important at the moment as parquet_timestamp_type is a development query option and should be mainly used to test whether the new timestamps can be read by other Hadoop components. I think that the most costly things in this function are time_.fractional_seconds() and UtcToUnixTime()'s time_.total_seconds(), as these need int64 integer division. These could be avoided by not using UtcToUnixTime and converting to nanoseconds from day_ + time_ directly, but this has to be done carefully near the edge values. http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test File testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test: http://gerrit.cloudera.org:8080/#/c/12247/8/testdata/workloads/functional-query/queries/QueryTest/parquet-int64-timestamps.test@102 PS8, Line 102: ---- QUERY : create table int96_nanos (ts timestamp) stored as parquet; : ==== : ---- QUERY : # Insert edge values as "normal" int96 timestamps that can represent all values. : set parquet_timestamp_type=INT96_NANOS; : insert into int96_nanos values > nit: you dont't need to start a new QUERY block for each query when you don Thanks for the tip! -- To view, visit http://gerrit.cloudera.org:8080/12247 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib41ad532ec902ed5a9a1528513726eac1c11441f Gerrit-Change-Number: 12247 Gerrit-PatchSet: 8 Gerrit-Owner: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Zoltan Ivanfi <[email protected]> Gerrit-Comment-Date: Fri, 01 Mar 2019 13:35:32 +0000 Gerrit-HasComments: Yes
