Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16909 )

Change subject: IMPALA-5675: Support UTF-8 Varchar and Char types
......................................................................


Patch Set 14:

(16 comments)

Looks good.

I think we should maintain the utf8 sound-ness property within impala whenever 
a piece of data is introduced into the engine via: literals in query, 
operations (such as concat or substring), read from a table etc.
This can be achieved via a UTF8 validity test.

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/codegen/llvm-codegen.cc
File be/src/codegen/llvm-codegen.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/codegen/llvm-codegen.cc@560
PS14, Line 560: TYPE_FIXED_UDA_INTERMEDIATE
We probably should treat this immediate type the same as CHAR.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/hdfs-avro-scanner-ir.cc
File be/src/exec/hdfs-avro-scanner-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/hdfs-avro-scanner-ir.cc@229
PS14, Line 229: sv->len = FindUtf8Pos(sv->ptr, len.val, max_len);
The UTF8 validity test should be performed here.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/hdfs-avro-scanner-ir.cc@252
PS14, Line 252: FindUtf8Pos(reinterpret_cast<char*>(*data), len.val, max_len);
Same as above. UTF8 validity test here.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/kudu-scanner.cc@404
PS14, Line 404: "Kudu returns longer length in VARCHAR
UTF8 validity test should be added here.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/kudu-util.cc
File be/src/exec/kudu-util.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/kudu-util.cc@150
PS14, Line 150: , is_utf8
This is another example on utf8 mode. If utf8 is a property of the data which 
should be, then we will not need argument.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/orc-column-readers.cc@209
PS14, Line 209:  unpadded_len = FindUtf8Pos(src_ptr, src_len, char_len);
Same comment as about the decode() method: UTF8 validity test should be done 
here to catch non-UTF8 data early.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/orc-column-readers.cc@223
PS14, Line 223: dst->len = FindUtf8Pos(src_ptr, src_len, char_len);
Same comment as above (utf8 validity).


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/parquet/parquet-common.h@222
PS14, Line 222: is_utf8,
For non BYTE_ARRAY type, it is better to supply 'false' directly here.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/parquet/parquet-common.h@513
PS14, Line 513: // In UTF8 mode, 'fixed_len_size' is 4N for Char(N)/Varchar(N) 
slots
I think some type of check on UTF8 validity of the buffer content should be 
done here and report error as needed. This will catch a piece of binary data 
that is not in UTF8 and the UTF8 interpretation is requested.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/text-converter.inline.h
File be/src/exec/text-converter.inline.h:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/text-converter.inline.h@64
PS14, Line 64: type.type == TYPE_VARCHAR
I thought GetMaxStrLen() is defined only for TYPE_CHAR.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/text-converter.inline.h@100
PS14, Line 100:  if (type.type == TYPE_CHAR) {
              :         // Truncates the string if in UTF-8 mode.
              :         if (type.is_utf8) str.len = FindUtf8Pos(str.ptr, 
str.len, type.len);
              :         StringValue::PadWithSpaces(str.ptr, buffer_len, 
str.len);
              :         str.len = buffer_len;
              :       } else if (type.type == TYPE_VARCHAR && type.is_utf8) {
              :         // Truncates the string if in UTF-8 mode.
              :         str.len = FindUtf8Pos(str.ptr, str.len, type.len);
              :       }
Can we reuse CastFunctions::CastToChar(FunctionContext* ctx, const StringVal& 
val) here? It would be nice to avoid code duplication.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/agg-fn-evaluator.cc
File be/src/exprs/agg-fn-evaluator.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/agg-fn-evaluator.cc@284
PS14, Line 284: intermediate type
Good question! From the comment below, the immediate type can be created for 
MIN(<char column in utf8>). I think we should definitely consider 
TYPE_FIXED_UDA_INTERMEDIATE to be utf8.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/cast-functions-ir.cc
File be/src/exprs/cast-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/cast-functions-ir.cc@166
PS14, Line 166: false>
It would be nice if we do not pay the new check on UTF8 mode here. That is, we 
keep the original code?


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/cast-functions-ir.cc@192
PS14, Line 192: lse>(
Same comment as above. That is, we keep the original code.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/cast-functions-ir.cc@272
PS14, Line 272: FindUtf8Pos
It seems we need to call FindUtf8Pos() only for the case of truncation cast. 
For padding cast, we just need to add extra # of single-character space 
characters.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/cast-functions-ir.cc@279
PS14, Line 279: res_bytes_len
The name of the variable seems not in agreement with the unit computed from the 
RHS. Should it be res_bytes_len += (max_char_cnt - utf8_cnt) *4?



--
To view, visit http://gerrit.cloudera.org:8080/16909
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f
Gerrit-Change-Number: 16909
Gerrit-PatchSet: 14
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Tue, 23 Feb 2021 16:54:09 +0000
Gerrit-HasComments: Yes

Reply via email to