Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and 
locate()
......................................................................


Patch Set 3:

(6 comments)

Looks very good! Thanks a lot.

http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@134
PS2, Line 134: sizeof(byte_lens) / sizeof(int);
> Sorry, do you mean using 24 directly?
Never mind. sizeof(byte_lens) / sizeof(int) sounds fine.

Done.


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc@118
PS2, Line 118: nt bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
             :     int malformed_bytes = last_pos - pos - bytes_len;
> OK, I was thinking that adding too much complexity here would block future
Sounds like a good idea. Please refer to new comments in this method.


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@102
PS3, Line 102: --index;
I wonder if we can remove --index here, since index is for the ith Unicode 
character in UTF8, right?

That is to say, we treat illegal UTF8 characters as noise and ignore them.


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@118
PS3, Line 118:     int bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
Above line 118, we need to do the following to guard the case that a prefix of 
'ptr' is not in UTF8.

if (pos < 0) break;


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@120
PS3, Line 120:  if (index < malformed_bytes) {
If index counts only the legal UTF8 characters, it probably should not be 
compared with malformed_bytes.


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@125
PS3, Line 125: ndex -= malformed_bytes + 1;
same comment as above.



--
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Wed, 23 Jun 2021 12:48:25 +0000
Gerrit-HasComments: Yes

Reply via email to