Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/16909 )
Change subject: IMPALA-5675: Support UTF-8 Varchar and Char types ...................................................................... Patch Set 14: (16 comments) Looks good. I think we should maintain the utf8 sound-ness property within impala whenever a piece of data is introduced into the engine via: literals in query, operations (such as concat or substring), read from a table etc. This can be achieved via a UTF8 validity test. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/codegen/llvm-codegen.cc File be/src/codegen/llvm-codegen.cc: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/codegen/llvm-codegen.cc@560 PS14, Line 560: TYPE_FIXED_UDA_INTERMEDIATE We probably should treat this immediate type the same as CHAR. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/hdfs-avro-scanner-ir.cc File be/src/exec/hdfs-avro-scanner-ir.cc: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/hdfs-avro-scanner-ir.cc@229 PS14, Line 229: sv->len = FindUtf8Pos(sv->ptr, len.val, max_len); The UTF8 validity test should be performed here. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/hdfs-avro-scanner-ir.cc@252 PS14, Line 252: FindUtf8Pos(reinterpret_cast<char*>(*data), len.val, max_len); Same as above. UTF8 validity test here. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/kudu-scanner.cc File be/src/exec/kudu-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/kudu-scanner.cc@404 PS14, Line 404: "Kudu returns longer length in VARCHAR UTF8 validity test should be added here. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/kudu-util.cc File be/src/exec/kudu-util.cc: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/kudu-util.cc@150 PS14, Line 150: , is_utf8 This is another example on utf8 mode. If utf8 is a property of the data which should be, then we will not need argument. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/orc-column-readers.cc@209 PS14, Line 209: unpadded_len = FindUtf8Pos(src_ptr, src_len, char_len); Same comment as about the decode() method: UTF8 validity test should be done here to catch non-UTF8 data early. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/orc-column-readers.cc@223 PS14, Line 223: dst->len = FindUtf8Pos(src_ptr, src_len, char_len); Same comment as above (utf8 validity). http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/parquet/parquet-common.h@222 PS14, Line 222: is_utf8, For non BYTE_ARRAY type, it is better to supply 'false' directly here. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/parquet/parquet-common.h@513 PS14, Line 513: // In UTF8 mode, 'fixed_len_size' is 4N for Char(N)/Varchar(N) slots I think some type of check on UTF8 validity of the buffer content should be done here and report error as needed. This will catch a piece of binary data that is not in UTF8 and the UTF8 interpretation is requested. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/text-converter.inline.h File be/src/exec/text-converter.inline.h: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/text-converter.inline.h@64 PS14, Line 64: type.type == TYPE_VARCHAR I thought GetMaxStrLen() is defined only for TYPE_CHAR. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/text-converter.inline.h@100 PS14, Line 100: if (type.type == TYPE_CHAR) { : // Truncates the string if in UTF-8 mode. : if (type.is_utf8) str.len = FindUtf8Pos(str.ptr, str.len, type.len); : StringValue::PadWithSpaces(str.ptr, buffer_len, str.len); : str.len = buffer_len; : } else if (type.type == TYPE_VARCHAR && type.is_utf8) { : // Truncates the string if in UTF-8 mode. : str.len = FindUtf8Pos(str.ptr, str.len, type.len); : } Can we reuse CastFunctions::CastToChar(FunctionContext* ctx, const StringVal& val) here? It would be nice to avoid code duplication. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/agg-fn-evaluator.cc File be/src/exprs/agg-fn-evaluator.cc: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/agg-fn-evaluator.cc@284 PS14, Line 284: intermediate type Good question! From the comment below, the immediate type can be created for MIN(<char column in utf8>). I think we should definitely consider TYPE_FIXED_UDA_INTERMEDIATE to be utf8. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/cast-functions-ir.cc File be/src/exprs/cast-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/cast-functions-ir.cc@166 PS14, Line 166: false> It would be nice if we do not pay the new check on UTF8 mode here. That is, we keep the original code? http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/cast-functions-ir.cc@192 PS14, Line 192: lse>( Same comment as above. That is, we keep the original code. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/cast-functions-ir.cc@272 PS14, Line 272: FindUtf8Pos It seems we need to call FindUtf8Pos() only for the case of truncation cast. For padding cast, we just need to add extra # of single-character space characters. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/cast-functions-ir.cc@279 PS14, Line 279: res_bytes_len The name of the variable seems not in agreement with the unit computed from the RHS. Should it be res_bytes_len += (max_char_cnt - utf8_cnt) *4? -- To view, visit http://gerrit.cloudera.org:8080/16909 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f Gerrit-Change-Number: 16909 Gerrit-PatchSet: 14 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Tue, 23 Feb 2021 16:54:09 +0000 Gerrit-HasComments: Yes
