Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16909 )
Change subject: IMPALA-5675: Support UTF-8 Varchar and Char types ...................................................................... Patch Set 18: (8 comments) Resolved the remaining comments. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/kudu-util.cc File be/src/exec/kudu-util.cc: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/kudu-util.cc@150 PS14, Line 150: EAN); > This is another example on utf8 mode. If utf8 is a property of the data whi Ack http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/parquet/parquet-common.h@222 PS14, Line 222: 967 > For non BYTE_ARRAY type, it is better to supply 'false' directly here. Done http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/text-converter.inline.h File be/src/exec/text-converter.inline.h: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/text-converter.inline.h@100 PS14, Line 100: if (type.type == TYPE_CHAR) { : // Truncates the string if in UTF-8 mode. : if (type.is_utf8) str.len = FindUtf8PosForward(str.ptr, str.len, type.char_len); : StringValue::PadWithSpaces(str.ptr, buffer_len, str.len); : str.len = buffer_len; : } else if (type.type == TYPE_VARCHAR && type.is_utf8) { : // Truncates the string if in UTF-8 mode. : str.len = FindUtf8PosForward(str.ptr, str.len, type.char_len); : } > Can we reuse CastFunctions::CastToChar(FunctionContext* ctx, const StringVa CastFunctions::CastToChar() is the implementation of CAST(x as CHAR(N)). It's more complex than this case. Here we are materializing a CHAR/VARCHAR slot. For a CHAR slot, its size is always char_len * 4, i.e. 'buffer_len'. We can simply fill the remaining bytes with whitespaces. However, in CastFunctions::CastToChar(), we are returning a StringVal which has variable length. We need to calculate the actual characters in the CHAR(N) string, and then append the accurate amount of missing spaces. So it's hard to reuse CastFunctions::CastToChar() here. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/agg-fn-evaluator.cc File be/src/exprs/agg-fn-evaluator.cc: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/agg-fn-evaluator.cc@284 PS14, Line 284: intermediate type > Good question! From the comment below, the immediate type can be created fo I think we can consider that if we do have a new agg function needs it in the future. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/udf/udf.h File be/src/udf/udf.h: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/udf/udf.h@106 PS14, Line 106: int > Maybe indicate it is the length in characters. Changed this to char_len since we don't need storage_len/byte_len in it. http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/Analyzer.java File fe/src/main/java/org/apache/impala/analysis/Analyzer.java: http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@2813 PS14, Line 2813: titute(nullS > This probably can create confusions as the interpretation of the column cha As discussed offline, column/table property on character set is not supported even in Hive. It also introduces the burdon of upgrading existing tables and modifying existing workloads/queries at the same time. Let's target on it in the future. Feel free to file a JIRA for it. http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java File fe/src/main/java/org/apache/impala/analysis/CastExpr.java: http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@65 PS14, Line 65: isUtf8_ = e.isUtf8_; > Should we check the targetType here too? For example, for CAST(c as INT), i If the target type is not a string type, the isUtf8 marker won't be used. So I think it's ok to leave it as-is. http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@105 PS14, Line 105: 8_ = e.isU > Same comment as above. Same reply as above. -- To view, visit http://gerrit.cloudera.org:8080/16909 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f Gerrit-Change-Number: 16909 Gerrit-PatchSet: 18 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Sat, 28 May 2022 10:34:45 +0000 Gerrit-HasComments: Yes
