[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Qifan Chen (Code Review) Fri, 16 Jul 2021 07:31:12 -0700

Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17580 )


Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and 
locate()
......................................................................


Patch Set 10:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc@273
PS9, Line 273: if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt
> As we discussed in https://gerrit.cloudera.org/c/17580/5/be/src/exprs/strin
See my other comment. In the worst case of malformed characters, we should also 
make sure the 2nd, 3rd, and 4th byte start with 0x10.


http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc@116
PS9, Line 116: while (pos >= 0) {
             :     // Point to the start byte of the last character.
             :     while (!BitUtil::IsUtf8StartByte(ptr[pos]) && pos >= 0) 
--pos;
             :     if (pos < 0) {
             :       // Can't find any legal characters. Count each byte from 
last_pos as one character.
             :       // Note that index is 0-based.
             :       if (index < last_pos) return last_pos - index - 1;
             :       return -1;
             :     }
             :     // Get bytes length of the located character.
             :     int bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
             :     // If there are not enough bytes after the first byte, i.e. 
last_pos-pos < bytes_len,
             :     // we consider the bytes belong to a malformed character, 
and count them as one
             :     // character.
             :     int malformed_bytes = max(last_pos - pos - bytes_len, 0);
             :     if (index < malformed_bytes) {
             :       // Count each malformed bytes as one character.
             :       return last_pos - index - 1;
             :     }
             :     // We found a legal character and 'malformed_bytes' 
malformed characters.
             :     // At this point, index >= malformed_bytes. So the lowest 
value of the updated index
             :     // is -1, which means 'pos' points at what we want.
             :     index -= malformed_bytes + 1;
             :     if (index < 0) return pos;
             :     last_pos = pos;
             :     --pos;
             :   }
> Yeah, the complexity is due to handling "underflow" and "overflow" malforme
>From the comment to this function
"/// Error handling:
/// Malformed UTF8 characters are counted as one byte per character. This is 
consistent
/// with Hive since Hive replaces those bytes to U+FFFD (REPLACEMENT 
CHARACTER)."

I have the impression that if in the span of at most 4 bytes, there are no 
valid utf8 characters, then we count as each byte as a malformed character.

In the example of [\xc3 \x83 \x83], yes I think you are right about missing the 
2nd \x83. This limitation can be remedied by checking the length of the first 
recognized utf8 char: if it occupies all the way to 'pos', it is valid utf8; 
otherwise, ptr[pos] is the invalid one.

Also, it looks like we need to check the 2nd, 3rd or 4th byte after seeing a 
utf8 start byte, if malformed characters can be present and detected.

Normally, I think we only need to check malformed characters when a string is 
from the user/client/risky_source/produced_by_some_SQL_functions and reject the 
string if so. After that, any utf8 strings can be assumed in utf8 and malformed 
character checking can be ignored.



--
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 10
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Fri, 16 Jul 2021 14:30:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Reply via email to