Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )
Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate() ...................................................................... Patch Set 9: (3 comments) Looks great! http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc File be/src/exprs/string-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc@273 PS9, Line 273: if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt nit. Performance-wise, I wonder if we can skip all its #bytes when the first byte of a UTF8 char is recognized. for (int i = 0; i < len; ) { if (BitUtil::IsUtf8StartByte(ptr[i])) { ++cnt; i += #uTF8bytes(ptr[i]) } eles { i++; } http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.h File be/src/util/string-util.h: http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.h@79 PS9, Line 79: the last second nit. second last http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc@116 PS9, Line 116: while (pos >= 0) { : // Point to the start byte of the last character. : while (!BitUtil::IsUtf8StartByte(ptr[pos]) && pos >= 0) --pos; : if (pos < 0) { : // Can't find any legal characters. Count each byte from last_pos as one character. : // Note that index is 0-based. : if (index < last_pos) return last_pos - index - 1; : return -1; : } : // Get bytes length of the located character. : int bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]); : // If there are not enough bytes after the first byte, i.e. last_pos-pos < bytes_len, : // we consider the bytes belong to a malformed character, and count them as one : // character. : int malformed_bytes = max(last_pos - pos - bytes_len, 0); : if (index < malformed_bytes) { : // Count each malformed bytes as one character. : return last_pos - index - 1; : } : // We found a legal character and 'malformed_bytes' malformed characters. : // At this point, index >= malformed_bytes. So the lowest value of the updated index : // is -1, which means 'pos' points at what we want. : index -= malformed_bytes + 1; : if (index < 0) return pos; : last_pos = pos; : --pos; : } nit. Compare to FindUtf8PosForward(), this loop looks complex, partially due to the first inner loop at line 118 that can over-count # of illegal characters beyond the desired number ('index'). I think if we can do something like FindUtf8PosForward(), this function can be simplified a little bit. The BNF grammar for the parsing work is list of <CHAR> <CHAR> := <UTF8_CHAR> | <ILLEGAL_CHAR> So the loop can be modeled as follows. while (index > 0 && pos >= 0) { pos = find_one_CHAR_backwards(pos)-1; index-- } // Find one CHAR in the range [pos-3, pos] and // return the position of the first byte of the // character. int find_one_CHAR_backwards(int pos) { int last_pos = pos; for (int i=0; i<4; i++) { if (BitUtil::IsUtf8StartByte(ptr[pos])) return pos; pos--; if (pos<0) break; } // non UTF8 character return last_pos; } -- To view, visit http://gerrit.cloudera.org:8080/17580 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662 Gerrit-Change-Number: 17580 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Thu, 15 Jul 2021 14:59:04 +0000 Gerrit-HasComments: Yes
