Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16909 )

Change subject: IMPALA-5675: Support UTF-8 Varchar and Char types
......................................................................


Patch Set 17:

(23 comments)

Resolved most of the comments.

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/codegen/llvm-codegen.cc
File be/src/codegen/llvm-codegen.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/codegen/llvm-codegen.cc@560
PS14, Line 560: urn bool_type();
> We probably should treat this immediate type the same as CHAR.
Yeah, now they are the same again (using 'byte_len').


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/hdfs-avro-scanner-ir.cc
File be/src/exec/hdfs-avro-scanner-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/hdfs-avro-scanner-ir.cc@229
PS14, Line 229: sv->len = FindUtf8PosForward(sv->ptr, len.val, ma
> The UTF8 validity test should be performed here.
Let's handle this in IMPALA-10761 since it's not trivial.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/hdfs-avro-scanner-ir.cc@252
PS14, Line 252: FindUtf8PosForward(*data, len.val, max_char_len);
> Same as above. UTF8 validity test here.
Let's handle this in IMPALA-10761 since it's not trivial.


http://gerrit.cloudera.org:8080/#/c/16909/16/be/src/exec/hdfs-avro-scanner.cc
File be/src/exec/hdfs-avro-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16909/16/be/src/exec/hdfs-avro-scanner.cc@642
PS16, Line 642:           success = ReadAvroVarchar(slot_type, 
state_->utf8_mode(),
> line too long (95 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16909/16/be/src/exec/hdfs-avro-scanner.cc@645
PS16, Line 645:           success = ReadAvroChar(slot_type, state_->utf8_mode(),
> line too long (92 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16909/16/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16909/16/be/src/exec/hdfs-orc-scanner.cc@1152
PS16, Line 1152:         dst_ptr = reinterpret_cast<char*>(
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/kudu-scanner.cc@404
PS14, Line 404: max_chars_num << ") value!";
> UTF8 validity test should be added here.
Let's handle this in IMPALA-10761 since it's not trivial.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/orc-column-readers.cc@209
PS14, Line 209:      row_idx, scratch_batch, scratch_batch_idx);
> Same comment as about the decode() method: UTF8 validity test should be don
Let's handle this in IMPALA-10761 since it's not trivial.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/orc-column-readers.cc@223
PS14, Line 223: ch (slot_desc_->type().type) {
> Same comment as above (utf8 validity).
Let's handle this in IMPALA-10761 since it's not trivial.


http://gerrit.cloudera.org:8080/#/c/16909/16/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/16909/16/be/src/exec/orc-column-readers.cc@304
PS16, Line 304:     StringValue::PadWithSpaces(dst_char, 
slot_desc_->type().byte_len, unpadded_len);
> line too long (91 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/parquet/parquet-common.h@513
PS14, Line 513: uble* v) {
> I think some type of check on UTF8 validity of the buffer content should be
Let's handle this in IMPALA-10761 since it's not trivial.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/text-converter.inline.h
File be/src/exec/text-converter.inline.h:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/text-converter.inline.h@64
PS14, Line 64: type.type == TYPE_VARCHAR
> I thought GetMaxStrLen() is defined only for TYPE_CHAR.
Changed to use 'byte_len' to make this more clear.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/cast-functions-ir.cc
File be/src/exprs/cast-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/cast-functions-ir.cc@272
PS14, Line 272: max_char_cn
> It seems we need to call FindUtf8Pos() only for the case of truncation cast
For padding cast, we need to know how many characters are in this string, i.e. 
'utf8_cnt', so we can decide the # of padding spaces.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/cast-functions-ir.cc@279
PS14, Line 279:
> The name of the variable seems not in agreement with the unit computed from
We are padding spaces each is encoded into one byte. So we don't need to 
multiply 4 here. Added a comment here.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/slot-ref.cc
File be/src/exprs/slot-ref.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/slot-ref.cc@249
PS14, Line 249:
> I thought TYPE_FIXED_UDA_INTERMEDIATE is utf8 irrelevant, no?
yes, it's irrelevant. I change 'len' to 'byte_len' to make it more clear.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/slot-ref.cc@393
PS14, Line 393: )
> I thought TYPE_FIXED_UDA_INTERMEDIATE is utf8 irrelevant, no?
yes, it's irrelevant. I change 'len' to 'byte_len' to make it more clear.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/runtime/types.cc
File be/src/runtime/types.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/runtime/types.cc@52
PS14, Line 52: f8 = scalar_type.__isset.is_utf8 &&
> Consider make a special case for TYPE_FIXED_UDA_INTEMEDIATE which cares onl
Done


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/runtime/types.cc@333
PS14, Line 333:
> Probably should be STRING(UTF8).
I think "STRING UTF8" is better since it's consistent with "CHAR(N) UTF8" and 
"VARCHAR(N) UTF8"


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/udf/udf-internal.h
File be/src/udf/udf-internal.h:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/udf/udf-internal.h@148
PS14, Line 148: _
> May add a comment "to return storage size".
Done. Renamed this to RETURN_TYPE_BYTE_SIZE


http://gerrit.cloudera.org:8080/#/c/16909/14/common/thrift/Types.thrift
File common/thrift/Types.thrift:

http://gerrit.cloudera.org:8080/#/c/16909/14/common/thrift/Types.thrift@66
PS14, Line 66: . CHAR(3) has len=3
> May add: len is the number of characters for CHAR or VARCHAR.
Done


http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
File fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java:

http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java@296
PS14, Line 296:
> It probably is a good idea to move this logic to ScalarType where CHAR data
This is now in SlotDescriptor#getMaterializedSlotSize(). I think it's ok to be 
not in ScalarType.java since the Type instances are shared across the metadata 
and don't contain the state of isUtf8Mode().


http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/catalog/ScalarType.java
File fe/src/main/java/org/apache/impala/catalog/ScalarType.java:

http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/catalog/ScalarType.java@42
PS14, Line 42: number of characters, i.
> better reworded as "by number of characters, for CHAR and VCHAR".
Done. Added 'byte_len' in BE.


http://gerrit.cloudera.org:8080/#/c/16909/14/testdata/workloads/functional-query/queries/QueryTest/utf8-chars-casting.test
File 
testdata/workloads/functional-query/queries/QueryTest/utf8-chars-casting.test:

http://gerrit.cloudera.org:8080/#/c/16909/14/testdata/workloads/functional-query/queries/QueryTest/utf8-chars-casting.test@14
PS14, Line 14:
> Consider some other cast examples:  char-> non-char.
hmm, I'm not quite understand this. I guess '?' above is a unicode character 
and got escaped by Gerrit..
Added a test like cast(cast('1234' as char(2)) as int)



--
To view, visit http://gerrit.cloudera.org:8080/16909
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f
Gerrit-Change-Number: 16909
Gerrit-PatchSet: 17
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Wed, 25 May 2022 13:57:57 +0000
Gerrit-HasComments: Yes

Reply via email to