[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Quanlong Huang (Code Review) Thu, 15 Jul 2021 18:17:20 -0700

Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17580 )


Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and 
locate()
......................................................................


Patch Set 9:

(2 comments)

Thank Qifan's suggestion for simplifying the codes. Replied the comments.

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.h
File be/src/util/string-util.h:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.h@79
PS9, Line 79: the last second
> nit. second last
Done


http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc@116
PS9, Line 116: while (pos >= 0) {
             :     // Point to the start byte of the last character.
             :     while (!BitUtil::IsUtf8StartByte(ptr[pos]) && pos >= 0) 
--pos;
             :     if (pos < 0) {
             :       // Can't find any legal characters. Count each byte from 
last_pos as one character.
             :       // Note that index is 0-based.
             :       if (index < last_pos) return last_pos - index - 1;
             :       return -1;
             :     }
             :     // Get bytes length of the located character.
             :     int bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
             :     // If there are not enough bytes after the first byte, i.e. 
last_pos-pos < bytes_len,
             :     // we consider the bytes belong to a malformed character, 
and count them as one
             :     // character.
             :     int malformed_bytes = max(last_pos - pos - bytes_len, 0);
             :     if (index < malformed_bytes) {
             :       // Count each malformed bytes as one character.
             :       return last_pos - index - 1;
             :     }
             :     // We found a legal character and 'malformed_bytes' 
malformed characters.
             :     // At this point, index >= malformed_bytes. So the lowest 
value of the updated index
             :     // is -1, which means 'pos' points at what we want.
             :     index -= malformed_bytes + 1;
             :     if (index < 0) return pos;
             :     last_pos = pos;
             :     --pos;
             :   }
> nit. Compare to FindUtf8PosForward(), this loop looks complex, partially du
Yeah, the complexity is due to handling "underflow" and "overflow" malformed 
cases. Here "underflow" means there are less bytes than what the start byte 
indicates. "overflow" means there are more bytes than what the start byte 
indicates.

I think the above suggested change is simpler because it only handles the 
"underflow" cases. Let me give an example of the "overflow" case:

U+00C3 is the Latin Capital Letter "A" with Tilde. It's encoded into 2 bytes in 
UTF-8: [\xc3 \x83]. (You can print it by "echo -e '\xc3\x83'" in bash). The 
byte "\x83" will be a malformed UTF-8 byte if given along. It will be replaced 
by U+FFFD
(REPLACEMENT CHARACTER) in Hive or bash.

So the string, [\xc3 \x83 \x83], will be treated as two characters, i.e. 
[U+00C3, U+FFFD]. Using the above find_one_CHAR_backwards method, we will 
locate at the start byte \xc3 and ignore the second \x83.



--
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Fri, 16 Jul 2021 01:17:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Reply via email to