Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )
Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate() ...................................................................... Patch Set 10: (2 comments) http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc File be/src/exprs/string-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc@273 PS9, Line 273: if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt > See my other comment. In the worst case of malformed characters, we should I agree that there are lot more we can/should do on malformed characters. But they will lead to more inefficient codes, e.g. checking remaining bytes instead of just the start byte, which isn't more performant than the current version. Can we leave this as it is since it's consistent with Hive? I think IMPALA-10761 is a better JIRA to solve all these corner cases. http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc@116 PS9, Line 116: while (pos >= 0) { : // Point to the start byte of the last character. : while (!BitUtil::IsUtf8StartByte(ptr[pos]) && pos >= 0) --pos; : if (pos < 0) { : // Can't find any legal characters. Count each byte from last_pos as one character. : // Note that index is 0-based. : if (index < last_pos) return last_pos - index - 1; : return -1; : } : // Get bytes length of the located character. : int bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]); : // If there are not enough bytes after the first byte, i.e. last_pos-pos < bytes_len, : // we consider the bytes belong to a malformed character, and count them as one : // character. : int malformed_bytes = max(last_pos - pos - bytes_len, 0); : if (index < malformed_bytes) { : // Count each malformed bytes as one character. : return last_pos - index - 1; : } : // We found a legal character and 'malformed_bytes' malformed characters. : // At this point, index >= malformed_bytes. So the lowest value of the updated index : // is -1, which means 'pos' points at what we want. : index -= malformed_bytes + 1; : if (index < 0) return pos; : last_pos = pos; : --pos; : } > From the comment to this function Yeah, sorry that the comment is confusing, I think I should change it to Bytes that don't belong any UTF8 characters are considered malformed UTF8 characters as one byte per character. This is consistent with Hive since Hive replaces each of those bytes to U+FFFD (REPLACEMENT CHARACTER). Is it ok for you? To avoid back-and-forth on dealing with malformed utf8, can we address them in IMPALA-10761? -- To view, visit http://gerrit.cloudera.org:8080/17580 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662 Gerrit-Change-Number: 17580 Gerrit-PatchSet: 10 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Sat, 17 Jul 2021 01:34:54 +0000 Gerrit-HasComments: Yes
