Zihao Ye has posted comments on this change. ( http://gerrit.cloudera.org:8080/20681 )
Change subject: [WIP] IMPALA-12322: Support converting UTC timestamps read from Kudu to local time ...................................................................... Patch Set 5: (5 comments) Thanks for your review! I indeed didn't consider the issue of daylight saving time, as the region I'm in doesn't observe it. I appreciate your reminder and will take it into consideration. http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/exec/kudu/kudu-scanner.cc File be/src/exec/kudu/kudu-scanner.cc: http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/exec/kudu/kudu-scanner.cc@288 PS5, Line 288: const Timezone* local_tz = state_->query_options().convert_kudu_utc_timestamps ? : state_->local_time_zone() : UTCPTR; : : KuduValue* min_value; : RETURN_IF_ERROR(CreateKuduValue(col_type, min, &min_value, local_tz)); : KUDU_RETURN_IF_ERROR(scanner_->AddConjunctPredicate( : scanner_->GetKuduTable()->NewComparisonPredicate( : col_name, KuduPredicate::ComparisonOp::GREATER_EQUAL, min_value)), : BuildErrorString("Failed to add min predicate")); : : KuduValue* max_value; : RETURN_IF_ERROR(CreateKuduValue(col_type, max, &max_value, local_tz)); > This can filter out values incorrectly during dst changes. The > issue is that two utc micro values 1 hour apart can map to the same > local timestamp, e.g. e.g. both 01:10:00 (UTC) and 02:10:00 (UTC) > may be to 00:10:00 (local). A filter like 00:05:00 < ts < 00:15:00 > would be either converted to 02:05:00 < ts < 02:15:00 or 01:05:00 < > ts < 01:15:00, and both would drop one of the values incorrectly. > > Generally we should use the smaller possible conversion for min > value and the larger one for max value if they are ambiguous. A > similar logic was implemented for Parquet stat filtering at > https://github.com/apache/impala/blob/3be9b82207ddf32ca48ea1b2afbc88d8dffad8a4/be/src/exec/parquet/parquet-common.cc#L266 > , but it works the other way, we have the UTC min/max stats from > the Parquet file, and try to get their min/max local values. I tried using a similar approach to solve this problem, and after a simple test, it seems worked fine when converting UTC ambiguities. http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/exprs/timestamp-functions-ir.cc File be/src/exprs/timestamp-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/exprs/timestamp-functions-ir.cc@139 PS5, Line 139: local_time_zone > This should use time_zone_for_unix_time_conversions() instead of > local_time_zone to assume utc if use_local_tz_for_unix_timestamp_conversions > is false. > > If you need a function that behaves differently than other > functions, then it could have a special name like ToUnixMicrosForKudu I might consider removing this function later and implementing the conversion in a different way. http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/service/impala-server.cc File be/src/service/impala-server.cc: http://gerrit.cloudera.org:8080/#/c/20681/5/be/src/service/impala-server.cc@371 PS5, Line 371: DEFINE_bool(convert_kudu_utc_timestamps, false, > I don't think that we need a flag if we already have a query option, as the Done http://gerrit.cloudera.org:8080/#/c/20681/5/fe/src/main/java/org/apache/impala/planner/KuduScanNode.java File fe/src/main/java/org/apache/impala/planner/KuduScanNode.java: http://gerrit.cloudera.org:8080/#/c/20681/5/fe/src/main/java/org/apache/impala/planner/KuduScanNode.java@573 PS5, Line 573: kuduPredicate = KuduPredicate.newComparisonPredicate(column, op, unixMicros); > This may not work correctly during DST change, see my other > comments. After testing, it is indeed problematic, and I'm working on finding a solution. For the "in list" predicate, perhaps we can add both ambiguous values for the conversion. As for binary predicates, we could handle them similar to the min_max runtime filter by using a broader range of conversion values. Do you have any suggestions regarding this? http://gerrit.cloudera.org:8080/#/c/20681/5/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java File fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java: http://gerrit.cloudera.org:8080/#/c/20681/5/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java@485 PS5, Line 485: String functionName = analyzer.getQueryOptions().isConvert_kudu_utc_timestamps() ? : "to_unix_micros" : "utc_to_unix_micros"; : Expr toUnixTimeExpr = : new FunctionCallExpr(functionName, Lists.newArrayList(srcExpr)); > Bloom filters with timezone conversion are tricky - the problem is > that two different utc micro timestamps can be mapped to the same > local timestamp, e.g. in case of dst change. If we map the local > timestamp to a single utc timestamp during bloom filter creation, > then it will only match with one of the possible values that would > be converted to this timestamp, so we will incorrectly filter the > other value. The timezone conversion for Bloom filters is indeed tricky. Enabling both ambiguous values in the filter could potentially solve the problem, but it would require significant modifications. Currently, I'm considering disabling the Bloom filter for timezone conversion in Kudu and adding an additional query option to re-enable it in such cases. This would allow regions that do not observe daylight saving time to function properly. Another option is to discard the Bloom filter when ambiguous values are detected, although the extent of its impact is uncertain. Do you have any suggestions regarding this? -- To view, visit http://gerrit.cloudera.org:8080/20681 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I9a1e7a13e617cc18deef14289cf9b958588397d3 Gerrit-Change-Number: 20681 Gerrit-PatchSet: 5 Gerrit-Owner: Zihao Ye <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Zihao Ye <[email protected]> Gerrit-Comment-Date: Wed, 15 Nov 2023 12:32:55 +0000 Gerrit-HasComments: Yes
