Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )
Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate() ...................................................................... Patch Set 10: (2 comments) http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc File be/src/exprs/string-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc@273 PS9, Line 273: if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt > As we discussed in https://gerrit.cloudera.org/c/17580/5/be/src/exprs/strin See my other comment. In the worst case of malformed characters, we should also make sure the 2nd, 3rd, and 4th byte start with 0x10. http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc@116 PS9, Line 116: while (pos >= 0) { : // Point to the start byte of the last character. : while (!BitUtil::IsUtf8StartByte(ptr[pos]) && pos >= 0) --pos; : if (pos < 0) { : // Can't find any legal characters. Count each byte from last_pos as one character. : // Note that index is 0-based. : if (index < last_pos) return last_pos - index - 1; : return -1; : } : // Get bytes length of the located character. : int bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]); : // If there are not enough bytes after the first byte, i.e. last_pos-pos < bytes_len, : // we consider the bytes belong to a malformed character, and count them as one : // character. : int malformed_bytes = max(last_pos - pos - bytes_len, 0); : if (index < malformed_bytes) { : // Count each malformed bytes as one character. : return last_pos - index - 1; : } : // We found a legal character and 'malformed_bytes' malformed characters. : // At this point, index >= malformed_bytes. So the lowest value of the updated index : // is -1, which means 'pos' points at what we want. : index -= malformed_bytes + 1; : if (index < 0) return pos; : last_pos = pos; : --pos; : } > Yeah, the complexity is due to handling "underflow" and "overflow" malforme >From the comment to this function "/// Error handling: /// Malformed UTF8 characters are counted as one byte per character. This is consistent /// with Hive since Hive replaces those bytes to U+FFFD (REPLACEMENT CHARACTER)." I have the impression that if in the span of at most 4 bytes, there are no valid utf8 characters, then we count as each byte as a malformed character. In the example of [\xc3 \x83 \x83], yes I think you are right about missing the 2nd \x83. This limitation can be remedied by checking the length of the first recognized utf8 char: if it occupies all the way to 'pos', it is valid utf8; otherwise, ptr[pos] is the invalid one. Also, it looks like we need to check the 2nd, 3rd or 4th byte after seeing a utf8 start byte, if malformed characters can be present and detected. Normally, I think we only need to check malformed characters when a string is from the user/client/risky_source/produced_by_some_SQL_functions and reject the string if so. After that, any utf8 strings can be assumed in utf8 and malformed character checking can be ignored. -- To view, visit http://gerrit.cloudera.org:8080/17580 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662 Gerrit-Change-Number: 17580 Gerrit-PatchSet: 10 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Fri, 16 Jul 2021 14:30:45 +0000 Gerrit-HasComments: Yes
