Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/16909 )
Change subject: IMPALA-5675: Support UTF-8 Varchar and Char types ...................................................................... Patch Set 17: (23 comments) Resolved most of the comments. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/codegen/llvm-codegen.cc File be/src/codegen/llvm-codegen.cc: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/codegen/llvm-codegen.cc@560 PS14, Line 560: urn bool_type(); > We probably should treat this immediate type the same as CHAR. Yeah, now they are the same again (using 'byte_len'). http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/hdfs-avro-scanner-ir.cc File be/src/exec/hdfs-avro-scanner-ir.cc: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/hdfs-avro-scanner-ir.cc@229 PS14, Line 229: sv->len = FindUtf8PosForward(sv->ptr, len.val, ma > The UTF8 validity test should be performed here. Let's handle this in IMPALA-10761 since it's not trivial. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/hdfs-avro-scanner-ir.cc@252 PS14, Line 252: FindUtf8PosForward(*data, len.val, max_char_len); > Same as above. UTF8 validity test here. Let's handle this in IMPALA-10761 since it's not trivial. http://gerrit.cloudera.org:8080/#/c/16909/16/be/src/exec/hdfs-avro-scanner.cc File be/src/exec/hdfs-avro-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16909/16/be/src/exec/hdfs-avro-scanner.cc@642 PS16, Line 642: success = ReadAvroVarchar(slot_type, state_->utf8_mode(), > line too long (95 > 90) Done http://gerrit.cloudera.org:8080/#/c/16909/16/be/src/exec/hdfs-avro-scanner.cc@645 PS16, Line 645: success = ReadAvroChar(slot_type, state_->utf8_mode(), > line too long (92 > 90) Done http://gerrit.cloudera.org:8080/#/c/16909/16/be/src/exec/hdfs-orc-scanner.cc File be/src/exec/hdfs-orc-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16909/16/be/src/exec/hdfs-orc-scanner.cc@1152 PS16, Line 1152: dst_ptr = reinterpret_cast<char*>( > line too long (93 > 90) Done http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/kudu-scanner.cc File be/src/exec/kudu-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/kudu-scanner.cc@404 PS14, Line 404: max_chars_num << ") value!"; > UTF8 validity test should be added here. Let's handle this in IMPALA-10761 since it's not trivial. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/orc-column-readers.cc@209 PS14, Line 209: row_idx, scratch_batch, scratch_batch_idx); > Same comment as about the decode() method: UTF8 validity test should be don Let's handle this in IMPALA-10761 since it's not trivial. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/orc-column-readers.cc@223 PS14, Line 223: ch (slot_desc_->type().type) { > Same comment as above (utf8 validity). Let's handle this in IMPALA-10761 since it's not trivial. http://gerrit.cloudera.org:8080/#/c/16909/16/be/src/exec/orc-column-readers.cc File be/src/exec/orc-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/16909/16/be/src/exec/orc-column-readers.cc@304 PS16, Line 304: StringValue::PadWithSpaces(dst_char, slot_desc_->type().byte_len, unpadded_len); > line too long (91 > 90) Done http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/parquet/parquet-common.h@513 PS14, Line 513: uble* v) { > I think some type of check on UTF8 validity of the buffer content should be Let's handle this in IMPALA-10761 since it's not trivial. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/text-converter.inline.h File be/src/exec/text-converter.inline.h: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/text-converter.inline.h@64 PS14, Line 64: type.type == TYPE_VARCHAR > I thought GetMaxStrLen() is defined only for TYPE_CHAR. Changed to use 'byte_len' to make this more clear. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/cast-functions-ir.cc File be/src/exprs/cast-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/cast-functions-ir.cc@272 PS14, Line 272: max_char_cn > It seems we need to call FindUtf8Pos() only for the case of truncation cast For padding cast, we need to know how many characters are in this string, i.e. 'utf8_cnt', so we can decide the # of padding spaces. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/cast-functions-ir.cc@279 PS14, Line 279: > The name of the variable seems not in agreement with the unit computed from We are padding spaces each is encoded into one byte. So we don't need to multiply 4 here. Added a comment here. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/slot-ref.cc File be/src/exprs/slot-ref.cc: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/slot-ref.cc@249 PS14, Line 249: > I thought TYPE_FIXED_UDA_INTERMEDIATE is utf8 irrelevant, no? yes, it's irrelevant. I change 'len' to 'byte_len' to make it more clear. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/slot-ref.cc@393 PS14, Line 393: ) > I thought TYPE_FIXED_UDA_INTERMEDIATE is utf8 irrelevant, no? yes, it's irrelevant. I change 'len' to 'byte_len' to make it more clear. http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/runtime/types.cc File be/src/runtime/types.cc: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/runtime/types.cc@52 PS14, Line 52: f8 = scalar_type.__isset.is_utf8 && > Consider make a special case for TYPE_FIXED_UDA_INTEMEDIATE which cares onl Done http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/runtime/types.cc@333 PS14, Line 333: > Probably should be STRING(UTF8). I think "STRING UTF8" is better since it's consistent with "CHAR(N) UTF8" and "VARCHAR(N) UTF8" http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/udf/udf-internal.h File be/src/udf/udf-internal.h: http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/udf/udf-internal.h@148 PS14, Line 148: _ > May add a comment "to return storage size". Done. Renamed this to RETURN_TYPE_BYTE_SIZE http://gerrit.cloudera.org:8080/#/c/16909/14/common/thrift/Types.thrift File common/thrift/Types.thrift: http://gerrit.cloudera.org:8080/#/c/16909/14/common/thrift/Types.thrift@66 PS14, Line 66: . CHAR(3) has len=3 > May add: len is the number of characters for CHAR or VARCHAR. Done http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java File fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java: http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java@296 PS14, Line 296: > It probably is a good idea to move this logic to ScalarType where CHAR data This is now in SlotDescriptor#getMaterializedSlotSize(). I think it's ok to be not in ScalarType.java since the Type instances are shared across the metadata and don't contain the state of isUtf8Mode(). http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/catalog/ScalarType.java File fe/src/main/java/org/apache/impala/catalog/ScalarType.java: http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/catalog/ScalarType.java@42 PS14, Line 42: number of characters, i. > better reworded as "by number of characters, for CHAR and VCHAR". Done. Added 'byte_len' in BE. http://gerrit.cloudera.org:8080/#/c/16909/14/testdata/workloads/functional-query/queries/QueryTest/utf8-chars-casting.test File testdata/workloads/functional-query/queries/QueryTest/utf8-chars-casting.test: http://gerrit.cloudera.org:8080/#/c/16909/14/testdata/workloads/functional-query/queries/QueryTest/utf8-chars-casting.test@14 PS14, Line 14: > Consider some other cast examples: char-> non-char. hmm, I'm not quite understand this. I guess '?' above is a unicode character and got escaped by Gerrit.. Added a test like cast(cast('1234' as char(2)) as int) -- To view, visit http://gerrit.cloudera.org:8080/16909 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f Gerrit-Change-Number: 16909 Gerrit-PatchSet: 17 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Wed, 25 May 2022 13:57:57 +0000 Gerrit-HasComments: Yes
