Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/22293 )
Change subject: IMPALA-13627: Handle legacy Hive timezone conversion ...................................................................... Patch Set 3: (6 comments) http://gerrit.cloudera.org:8080/#/c/22293/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22293/3//COMMIT_MSG@37 PS3, Line 37: impact on performance Did you make some benchmarks? It could be useful to have an estimate of the impact, e.g. is it just minor or may ruin performance. convert-timestamp-benchmark.cc already exists for similar measurements. https://github.com/apache/impala/blob/5f4321373a3404867e170f7381db04a843aa7310/be/src/benchmarks/convert-timestamp-benchmark.cc#L380 http://gerrit.cloudera.org:8080/#/c/22293/3/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/22293/3/be/src/exec/parquet/parquet-common.h@815 PS3, Line 815: /// If timestamp t >= v before conversion, then this function converts v in such a : /// way that the same will be true after t is converted. : void ConvertMinStatToLocalTime(TimestampValue* v) const; : : /// If timestamp t <= v before conversion, then this function converts v in such a : /// way that the same will be true after t is converted. : void ConvertMaxStatToLocalTime(TimestampValue* v) const; As these are not modified, different timezone db will be used during per-row conversion and stat filtering. This can lead to incorrectly dropping some rows. The simple solution is to turn off min/max filtering for the timestamp column in case Hive legacy TZ conversion is used. Another way would be trying to do the LT/GT predicated mapping from local to utc, but this can be convoluted near DST changes. http://gerrit.cloudera.org:8080/#/c/22293/3/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/22293/3/be/src/service/impala-server.cc@362 PS3, Line 362: use_legacy_hive_timestamp_conversion I don't think that a new flag is really needed as you can override the query option in default_query_options. use_local_tz_for_unix_timestamp_conversions/convert_legacy_hive_parquet_utc_timestamps are there due to legacy reasons, as these flags were added before a query option existed. http://gerrit.cloudera.org:8080/#/c/22293/3/common/thrift/Frontend.thrift File common/thrift/Frontend.thrift: http://gerrit.cloudera.org:8080/#/c/22293/3/common/thrift/Frontend.thrift@1081 PS3, Line 1081: i64 nit: why int64? Calendar.get() return int http://gerrit.cloudera.org:8080/#/c/22293/3/tests/query_test/test_hive_timestamp_conversion.py File tests/query_test/test_hive_timestamp_conversion.py: http://gerrit.cloudera.org:8080/#/c/22293/3/tests/query_test/test_hive_timestamp_conversion.py@52 PS3, Line 52: self.client.execute( : "create table %s.t (d timestamp) stored as parquet" % unique_database) : tbl_loc = get_fs_path("/test-warehouse/%s.db/t" % unique_database) : self.filesystem_client.copy_from_local(os.environ['IMPALA_HOME'] : + "/testdata/data/hive_kuala_lumpur_legacy.parquet", tbl_loc) There are some convenience functions to do this, see create_table_and_copy_files() and create_table_from_parquet() in https://github.com/apache/impala/blob/master/tests/common/file_utils.py http://gerrit.cloudera.org:8080/#/c/22293/3/tests/query_test/test_hive_timestamp_conversion.py@62 PS3, Line 62: '1899-12-31 16:00:00', optional: I would prefer moving these tests to a .test file (and merge the 3 tests to avoid .test file spam) This is just my preference as I find sql tests more readable without the Python "noise". -- To view, visit http://gerrit.cloudera.org:8080/22293 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1271ed1da0b74366ab8315e7ec2d4ee47111e067 Gerrit-Change-Number: 22293 Gerrit-PatchSet: 3 Gerrit-Owner: Michael Smith <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Comment-Date: Sat, 11 Jan 2025 12:59:46 +0000 Gerrit-HasComments: Yes
