Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )
Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate() ...................................................................... Patch Set 9: (2 comments) Thank Qifan's suggestion for simplifying the codes. Replied the comments. http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.h File be/src/util/string-util.h: http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.h@79 PS9, Line 79: the last second > nit. second last Done http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc@116 PS9, Line 116: while (pos >= 0) { : // Point to the start byte of the last character. : while (!BitUtil::IsUtf8StartByte(ptr[pos]) && pos >= 0) --pos; : if (pos < 0) { : // Can't find any legal characters. Count each byte from last_pos as one character. : // Note that index is 0-based. : if (index < last_pos) return last_pos - index - 1; : return -1; : } : // Get bytes length of the located character. : int bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]); : // If there are not enough bytes after the first byte, i.e. last_pos-pos < bytes_len, : // we consider the bytes belong to a malformed character, and count them as one : // character. : int malformed_bytes = max(last_pos - pos - bytes_len, 0); : if (index < malformed_bytes) { : // Count each malformed bytes as one character. : return last_pos - index - 1; : } : // We found a legal character and 'malformed_bytes' malformed characters. : // At this point, index >= malformed_bytes. So the lowest value of the updated index : // is -1, which means 'pos' points at what we want. : index -= malformed_bytes + 1; : if (index < 0) return pos; : last_pos = pos; : --pos; : } > nit. Compare to FindUtf8PosForward(), this loop looks complex, partially du Yeah, the complexity is due to handling "underflow" and "overflow" malformed cases. Here "underflow" means there are less bytes than what the start byte indicates. "overflow" means there are more bytes than what the start byte indicates. I think the above suggested change is simpler because it only handles the "underflow" cases. Let me give an example of the "overflow" case: U+00C3 is the Latin Capital Letter "A" with Tilde. It's encoded into 2 bytes in UTF-8: [\xc3 \x83]. (You can print it by "echo -e '\xc3\x83'" in bash). The byte "\x83" will be a malformed UTF-8 byte if given along. It will be replaced by U+FFFD (REPLACEMENT CHARACTER) in Hive or bash. So the string, [\xc3 \x83 \x83], will be treated as two characters, i.e. [U+00C3, U+FFFD]. Using the above find_one_CHAR_backwards method, we will locate at the start byte \xc3 and ignore the second \x83. -- To view, visit http://gerrit.cloudera.org:8080/17580 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662 Gerrit-Change-Number: 17580 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Fri, 16 Jul 2021 01:17:12 +0000 Gerrit-HasComments: Yes
