Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )
Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate() ...................................................................... Patch Set 7: (4 comments) Thank Csaba's feedback and pointing out other corner cases of malformed UTF8 characters! I'm still modifying the patch but would like to post an update in time. TBO, this patch got more and more complex due to dealing with malformed UTF8 characters. At first, I tend to not doing any error handling and just mentioning that illegal UTF8 characters will have undesired results. Take other systems as examples, * in Presto, there are no explicit checks for valid UTF-8 and the functions may return incorrect results on invalid UTF-8: https://prestodb.io/docs/0.257/functions/string.html#string-functions * in ClickHouse, the UTF8 functions also assume that the string contains a set of bytes that make up UTF-8 encoded text. https://clickhouse.tech/docs/en/sql-reference/functions/string-functions/ They both provide special functions for dealing with invalid UTF8, e.g. from_utf8() in Presto and isValidUTF8()/toValidUTF8() in ClickHouse. I think the reason for this approach is for efficiency. Assuming valid UTF-8 characters helps to vectorise the implementation by SIMD instructions. However, Hive implicitly converts malformed UTF-8 characters. To be consistent with Hive, it seems we need to do the same at many places, which will results in inefficient and complex codes(branches). What do you think? Does it make sense if we choose the same behavior like Presto and ClickHouse that assume wrong results on invalid UTF-8 characters, and leave the consistent work done in IMPALA-10761? http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@105 PS5, Line 105: pos += BitUtil::NumBytesInUTF8Encoding(ptr[pos]); > I realized that issue mentioned in line 120 also exists here: if NumBytesIn Nice catching! One wrong length will mess up the whole processing since we can't jump to the next start byte correctly. I'm still thinking how to fix this in an efficient way. http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@115 PS5, Line 115: while (pos >= 0) { > The loop seems correct to me, bit it was a bit hard for me to understand it Good points! Done. http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@115 PS5, Line 115: > Is index >= 0 needed? We assume it to be >= 0 at the start and break the lo Sure. It's added at the first PS that we don't deal with illegal cases. We can remove it now. http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@120 PS5, Line 120: // Note that index is 0-based. > How should we handle the case when there is a valid utf8 start byte, but no Nice catching! So in this case, malformed_bytes < 0. We can handle it as 0, so the bytes of [pos, last_pos) are considered as one character. -- To view, visit http://gerrit.cloudera.org:8080/17580 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662 Gerrit-Change-Number: 17580 Gerrit-PatchSet: 7 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Wed, 14 Jul 2021 14:23:27 +0000 Gerrit-HasComments: Yes
