[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Qifan Chen (Code Review) Tue, 22 Jun 2021 07:51:05 -0700

Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17580 )


Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and 
locate()
......................................................................


Patch Set 2:

(8 comments)

Looks very good!

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/exprs/string-functions-ir.cc@673
PS1, Line 673:     // A positive starting position indicates regular searching 
from the left.
             :     int search_start_pos = start_position.val - 1;
             :     if (utf8_mode) {
             :
> We always need to calculate 'start_position.val - 1' since it's the last ar
Good point. Done.


http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h
File be/src/util/bit-util.h:

http://gerrit.cloudera.org:8080/#/c/17580/1/be/src/util/bit-util.h@132
PS1, Line 132: Utf8CodePointLen
> I think I read it as Utf8CodePoint-Len but not Utf8-CodePointLen so don't g
Yeah, it is a little bit trickle. I normally do not consider a particular utf8 
sequence S of bytes, for one unicode character C, as a code point. See 
http://tutorials.jenkov.com/unicode/index.html. See also the encoding section 
in https://en.wikipedia.org/wiki/UTF-8.

The code point of interest, is thus that of C encoded by S.

Would NumBytesInUTF8Encoding() sound better?


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@134
PS2, Line 134: sizeof(byte_lens) / sizeof(int);
nit. A general version would be to find the number of elements in the array.


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@136
PS2, Line 136: for (int i = 0; i < total_chars; ++i) {
             :     EXPECT_EQ(pos, FindUtf8PosForward(test_str, total_byte_len, 
i));
             :     pos += byte_lens[i];
             :   }
Good test case!


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@179
PS2, Line 179: chars = sizeof(byte_lens) / sizeof(int);
nit. same comment as above.


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util-test.cc@181
PS2, Line 181: for (int i = 0; i < total_chars; ++i) {
             :     pos -= byte_lens[total_chars - i - 1];
             :     EXPECT_EQ(pos, FindUtf8PosBackward(test_str, total_byte_len, 
i));
             :   }
cool


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc@99
PS2, Line 99: / Counting malformed UTF8 characters. This is consistent with 
Hive since Hive
            :     // replaces them by default.
nit. May move this to the .h file.


http://gerrit.cloudera.org:8080/#/c/17580/2/be/src/util/string-util.cc@118
PS2, Line 118: / We will get undesired results on malformed UTF8 characters.
             :     // TODO(IMPALA-10761): Add query option to control the error 
handling.
nit. We probably should check illegal utf8 characters here too.



--
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Tue, 22 Jun 2021 14:50:52 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Reply via email to