[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Quanlong Huang (Code Review) Fri, 16 Jul 2021 18:35:05 -0700

Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17580 )


Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and 
locate()
......................................................................


Patch Set 10:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc@273
PS9, Line 273: if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt
> See my other comment. In the worst case of malformed characters, we should
I agree that there are lot more we can/should do on malformed characters. But 
they will lead to more inefficient codes, e.g. checking remaining bytes instead 
of just the start byte, which isn't more performant than the current version.

Can we leave this as it is since it's consistent with Hive? I think 
IMPALA-10761 is a better JIRA to solve all these corner cases.


http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc@116
PS9, Line 116: while (pos >= 0) {
             :     // Point to the start byte of the last character.
             :     while (!BitUtil::IsUtf8StartByte(ptr[pos]) && pos >= 0) 
--pos;
             :     if (pos < 0) {
             :       // Can't find any legal characters. Count each byte from 
last_pos as one character.
             :       // Note that index is 0-based.
             :       if (index < last_pos) return last_pos - index - 1;
             :       return -1;
             :     }
             :     // Get bytes length of the located character.
             :     int bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
             :     // If there are not enough bytes after the first byte, i.e. 
last_pos-pos < bytes_len,
             :     // we consider the bytes belong to a malformed character, 
and count them as one
             :     // character.
             :     int malformed_bytes = max(last_pos - pos - bytes_len, 0);
             :     if (index < malformed_bytes) {
             :       // Count each malformed bytes as one character.
             :       return last_pos - index - 1;
             :     }
             :     // We found a legal character and 'malformed_bytes' 
malformed characters.
             :     // At this point, index >= malformed_bytes. So the lowest 
value of the updated index
             :     // is -1, which means 'pos' points at what we want.
             :     index -= malformed_bytes + 1;
             :     if (index < 0) return pos;
             :     last_pos = pos;
             :     --pos;
             :   }
> From the comment to this function
Yeah, sorry that the comment is confusing, I think I should change it to

 Bytes that don't belong any UTF8 characters are considered malformed UTF8 
characters as one byte per character. This is consistent with Hive since Hive 
replaces each of those bytes to U+FFFD (REPLACEMENT CHARACTER).

Is it ok for you?

To avoid back-and-forth on dealing with malformed utf8, can we address them in 
IMPALA-10761?



--
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Sat, 17 Jul 2021 01:34:54 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Reply via email to