Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )
Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate() ...................................................................... Patch Set 3: (6 comments) Looks very good! Thanks a lot. http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc File be/src/util/string-util-test.cc: http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@134 PS2, Line 134: sizeof(byte_lens) / sizeof(int); > Sorry, do you mean using 24 directly? Never mind. sizeof(byte_lens) / sizeof(int) sounds fine. Done. http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc@118 PS2, Line 118: nt bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]); : int malformed_bytes = last_pos - pos - bytes_len; > OK, I was thinking that adding too much complexity here would block future Sounds like a good idea. Please refer to new comments in this method. http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@102 PS3, Line 102: --index; I wonder if we can remove --index here, since index is for the ith Unicode character in UTF8, right? That is to say, we treat illegal UTF8 characters as noise and ignore them. http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@118 PS3, Line 118: int bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]); Above line 118, we need to do the following to guard the case that a prefix of 'ptr' is not in UTF8. if (pos < 0) break; http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@120 PS3, Line 120: if (index < malformed_bytes) { If index counts only the legal UTF8 characters, it probably should not be compared with malformed_bytes. http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@125 PS3, Line 125: ndex -= malformed_bytes + 1; same comment as above. -- To view, visit http://gerrit.cloudera.org:8080/17580 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662 Gerrit-Change-Number: 17580 Gerrit-PatchSet: 3 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Wed, 23 Jun 2021 12:48:25 +0000 Gerrit-HasComments: Yes
