Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/20681 )
Change subject: [WIP] IMPALA-12322: Support converting UTC timestamps read from Kudu to local time ...................................................................... Patch Set 5: (5 comments) Left a few comments mainly about DST change troubles, as the patch doesn't seem to be prepared for those. Generally reading the values from Kudu and converting them to local time should work, but handling predicates/runtime filters correctly can be very tricky. http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/exec/kudu/kudu-scanner.cc File be/src/exec/kudu/kudu-scanner.cc: http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/exec/kudu/kudu-scanner.cc@288 PS5, Line 288: const Timezone* local_tz = state_->query_options().convert_kudu_utc_timestamps ? : state_->local_time_zone() : UTCPTR; : : KuduValue* min_value; : RETURN_IF_ERROR(CreateKuduValue(col_type, min, &min_value, local_tz)); : KUDU_RETURN_IF_ERROR(scanner_->AddConjunctPredicate( : scanner_->GetKuduTable()->NewComparisonPredicate( : col_name, KuduPredicate::ComparisonOp::GREATER_EQUAL, min_value)), : BuildErrorString("Failed to add min predicate")); : : KuduValue* max_value; : RETURN_IF_ERROR(CreateKuduValue(col_type, max, &max_value, local_tz)); This can filter out values incorrectly during dst changes. The issue is that two utc micro values 1 hour apart can map to the same local timestamp, e.g. e.g. both 01:10:00 (UTC) and 02:10:00 (UTC) may be to 00:10:00 (local). A filter like 00:05:00 < ts < 00:15:00 would be either converted to 02:05:00 < ts < 02:15:00 or 01:05:00 < ts < 01:15:00, and both would drop one of the values incorrectly. Generally we should use the smaller possible conversion for min value and the larger one for max value if they are ambiguous. A similar logic was implemented for Parquet stat filtering at https://github.com/apache/impala/blob/3be9b82207ddf32ca48ea1b2afbc88d8dffad8a4/be/src/exec/parquet/parquet-common.cc#L266 , but it works the other way, we have the UTC min/max stats from the Parquet file, and try to get their min/max local values. http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/exprs/timestamp-functions-ir.cc File be/src/exprs/timestamp-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/exprs/timestamp-functions-ir.cc@139 PS5, Line 139: local_time_zone This should use time_zone_for_unix_time_conversions() instead of local_time_zone to assume utc if use_local_tz_for_unix_timestamp_conversions is false. If you need a function that behaves differently than other functions, then it could have a special name like ToUnixMicrosForKudu http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/service/impala-server.cc@371 PS5, Line 371: DEFINE_bool(convert_kudu_utc_timestamps, false, I don't think that we need a flag if we already have a query option, as the default of the query option can be set with flag default_query_options. convert_legacy_hive_parquet_utc_timestamps flag was kept for backwards compatibility, as it existed before the query option. http://gerrit.cloudera.org:8080/#/c/20681/5/fe/src/main/java/org/apache/impala/planner/KuduScanNode.java File fe/src/main/java/org/apache/impala/planner/KuduScanNode.java: http://gerrit.cloudera.org:8080/#/c/20681/5/fe/src/main/java/org/apache/impala/planner/KuduScanNode.java@573 PS5, Line 573: kuduPredicate = KuduPredicate.newComparisonPredicate(column, op, unixMicros); This may not work correctly during DST change, see my other comments. http://gerrit.cloudera.org:8080/#/c/20681/5/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java: http://gerrit.cloudera.org:8080/#/c/20681/5/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@485 PS5, Line 485: String functionName = analyzer.getQueryOptions().isConvert_kudu_utc_timestamps() ? : "to_unix_micros" : "utc_to_unix_micros"; : Expr toUnixTimeExpr = : new FunctionCallExpr(functionName, Lists.newArrayList(srcExpr)); Bloom filters with timezone conversion are tricky - the problem is that two different utc micro timestamps can be mapped to the same local timestamp, e.g. in case of dst change. If we map the local timestamp to a single utc timestamp during bloom filter creation, then it will only match with one of the possible values that would be converted to this timestamp, so we will incorrectly filter the other value. -- To view, visit http://gerrit.cloudera.org:8080/20681 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9a1e7a13e617cc18deef14289cf9b958588397d3 Gerrit-Change-Number: 20681 Gerrit-PatchSet: 5 Gerrit-Owner: Zihao Ye <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Zihao Ye <[email protected]> Gerrit-Comment-Date: Mon, 13 Nov 2023 18:33:23 +0000 Gerrit-HasComments: Yes
