[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Quanlong Huang (Code Review) Wed, 14 Jul 2021 07:23:36 -0700

Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17580 )


Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and 
locate()
......................................................................


Patch Set 7:

(4 comments)

Thank Csaba's feedback and pointing out other corner cases of malformed UTF8 
characters! I'm still modifying the patch but would like to post an update in 
time.

TBO, this patch got more and more complex due to dealing with malformed UTF8 
characters. At first, I tend to not doing any error handling and just 
mentioning that illegal UTF8 characters will have undesired results. Take other 
systems as examples,
* in Presto, there are no explicit checks for valid UTF-8 and the functions may 
return incorrect results on invalid UTF-8: 
https://prestodb.io/docs/0.257/functions/string.html#string-functions
* in ClickHouse, the UTF8 functions also assume that the string contains a set 
of bytes that make up UTF-8 encoded text. 
https://clickhouse.tech/docs/en/sql-reference/functions/string-functions/

They both provide special functions for dealing with invalid UTF8, e.g. 
from_utf8() in Presto and isValidUTF8()/toValidUTF8() in ClickHouse. I think 
the reason for this approach is for efficiency. Assuming valid UTF-8 characters 
helps to vectorise the implementation by SIMD instructions.

However, Hive implicitly converts malformed UTF-8 characters. To be consistent 
with Hive, it seems we need to do the same at many places, which will results 
in inefficient and complex codes(branches). What do you think? Does it make 
sense if we choose the same behavior like Presto and ClickHouse that assume 
wrong results on invalid UTF-8 characters, and leave the consistent work done 
in IMPALA-10761?

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@105
PS5, Line 105:     pos += BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
> I realized that issue mentioned in line 120 also exists here: if NumBytesIn
Nice catching! One wrong length will mess up the whole processing since we 
can't jump to the next start byte correctly. I'm still thinking how to fix this 
in an efficient way.


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@115
PS5, Line 115:   while (pos >= 0) {
> The loop seems correct to me, bit it was a bit hard for me to understand it
Good points! Done.


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@115
PS5, Line 115:
> Is index >= 0 needed? We assume it to be >= 0 at the start and break the lo
Sure. It's added at the first PS that we don't deal with illegal cases. We can 
remove it now.


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@120
PS5, Line 120:       // Note that index is 0-based.
> How should we handle the case when there is a valid utf8 start byte, but no
Nice catching! So in this case, malformed_bytes < 0. We can handle it as 0, so 
the bytes of [pos, last_pos) are considered as one character.



--
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Wed, 14 Jul 2021 14:23:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Reply via email to