Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )
Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate() ...................................................................... Patch Set 2: (8 comments) Looks very good! http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc File be/src/exprs/string-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@673 PS1, Line 673: // A positive starting position indicates regular searching from the left. : int search_start_pos = start_position.val - 1; : if (utf8_mode) { : > We always need to calculate 'start_position.val - 1' since it's the last ar Good point. Done. http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h File be/src/util/bit-util.h: http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h@132 PS1, Line 132: Utf8CodePointLen > I think I read it as Utf8CodePoint-Len but not Utf8-CodePointLen so don't g Yeah, it is a little bit trickle. I normally do not consider a particular utf8 sequence S of bytes, for one unicode character C, as a code point. See http://tutorials.jenkov.com/unicode/index.html. See also the encoding section in https://en.wikipedia.org/wiki/UTF-8. The code point of interest, is thus that of C encoded by S. Would NumBytesInUTF8Encoding() sound better? http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc File be/src/util/string-util-test.cc: http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@134 PS2, Line 134: sizeof(byte_lens) / sizeof(int); nit. A general version would be to find the number of elements in the array. http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@136 PS2, Line 136: for (int i = 0; i < total_chars; ++i) { : EXPECT_EQ(pos, FindUtf8PosForward(test_str, total_byte_len, i)); : pos += byte_lens[i]; : } Good test case! http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@179 PS2, Line 179: chars = sizeof(byte_lens) / sizeof(int); nit. same comment as above. http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@181 PS2, Line 181: for (int i = 0; i < total_chars; ++i) { : pos -= byte_lens[total_chars - i - 1]; : EXPECT_EQ(pos, FindUtf8PosBackward(test_str, total_byte_len, i)); : } cool http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc@99 PS2, Line 99: / Counting malformed UTF8 characters. This is consistent with Hive since Hive : // replaces them by default. nit. May move this to the .h file. http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc@118 PS2, Line 118: / We will get undesired results on malformed UTF8 characters. : // TODO(IMPALA-10761): Add query option to control the error handling. nit. We probably should check illegal utf8 characters here too. -- To view, visit http://gerrit.cloudera.org:8080/17580 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662 Gerrit-Change-Number: 17580 Gerrit-PatchSet: 2 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Tue, 22 Jun 2021 14:50:52 +0000 Gerrit-HasComments: Yes
