Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and 
locate()
......................................................................


Patch Set 3:

(4 comments)

Thanks for looking into the test failure! Addressed the comments.

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h
File be/src/util/bit-util.h:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h@132
PS1, Line 132: NumBytesInUTF8En
> Yeah, it is a little bit trickle. I normally do not consider a particular u
OK, change to NumBytesInUTF8Encoding().


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@134
PS2, Line 134: sizeof(byte_lens) / sizeof(int);
> nit. A general version would be to find the number of elements in the array
Sorry, do you mean using 24 directly?


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc@99
PS2, Line 99: / Counting malformed UTF8 characters.
            :     while (!BitUtil::IsUtf8Start
> nit. May move this to the .h file.
This explains the following while-loop. Let me mention the error handing in the 
header file.


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc@118
PS2, Line 118: nt bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
             :     int malformed_bytes = last_pos - pos - bytes_len;
> nit. We probably should check illegal utf8 characters here too.
OK, I was thinking that adding too much complexity here would block future 
optimizations. So planned to add the error handling in IMPALA-10761. But sure, 
let's make it right and consistent with FindUtf8PosForward() first.



-- 
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Wed, 23 Jun 2021 07:48:31 +0000
Gerrit-HasComments: Yes

Reply via email to