Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )
Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate() ...................................................................... Patch Set 5: (4 comments) http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util-test.cc File be/src/util/string-util-test.cc: http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util-test.cc@155 PS5, Line 155: EXPECT_EQ(10, FindUtf8PosForward("李小龙 \x93\x93 ", 10, 6)); Can you add a test for the "all malformed" case here and for FindUtf8PosBackward too? http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@115 PS5, Line 115: while (pos >= 0 && index >= 0) { The loop seems correct to me, bit it was a bit hard for me to understand it. Some optional ideas to make it easier to read: - replace the "break"s with "return"s. In this case the final return return could be return -1; + a DCHECK_EQ(pos, -1) could be added - at line 119 add a separate "return" block for the pos<0 case http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@115 PS5, Line 115: index >= 0 Is index >= 0 needed? We assume it to be >= 0 at the start and break the loop at line 127 if it becomes negative. If the intention is to defend from incorrect input in release mode, then I would prefer to return -1 if it index is negative. http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@120 PS5, Line 120: int malformed_bytes = last_pos - pos - bytes_len; How should we handle the case when there is a valid utf8 start byte, but not enough characters after it since last_pos? This would have the weird effect of increasing index at line 127. -- To view, visit http://gerrit.cloudera.org:8080/17580 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662 Gerrit-Change-Number: 17580 Gerrit-PatchSet: 5 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Mon, 12 Jul 2021 12:27:19 +0000 Gerrit-HasComments: Yes
