Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and 
locate()
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/exprs/string-functions-ir.cc@269
PS5, Line 269: CountUtf8Chars
This can return different values compared to the original implementation of 
Utf8Length if there are malformed characters - is this intentional? E.g. a two 
start bytes where NumBytesInUTF8Encoding==2 (without the following bytes) would 
return 2 in the original but 1 in the new implementation.

I don't know what is the correct implementation in this case, but the original 
version seems slightly safer.


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@105
PS5, Line 105:     pos += BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
I realized that issue mentioned in line 120 also exists here: if 
NumBytesInUTF8Encoding() returns more bytes than there are  in the string, then 
pos can point out of the array.



--
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Mon, 12 Jul 2021 12:43:48 +0000
Gerrit-HasComments: Yes

Reply via email to