Quanlong Huang created IMPALA-13367:
---------------------------------------
Summary: Improve performance in counting UTF8 string length
Key: IMPALA-13367
URL: https://issues.apache.org/jira/browse/IMPALA-13367
Project: IMPALA
Issue Type: Improvement
Components: Backend
Reporter: Quanlong Huang
In UTF-8 mode (i.e. set utf8_mode=true), we count string length using the start
byte of the UTF-8 character. In be/src/exprs/string-functions-ir.cc:
{code:cpp}
static int CountUtf8Chars(uint8_t* ptr, int len) {
if (ptr == nullptr) return 0;
int cnt = 0;
for (int i = 0; i < len; ++i) {
if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt;
}
return cnt;
}{code}
We can leverage SIMD instructions to improve this:
* Check if all bytes in a string are all ASCII characters.
* If so, return the length directly. Otherwise, use the normal code path.
In most of the cases, strings only have ASCII characters so this helps. Here is
an example in Kudu:
https://github.com/apache/kudu/blob/86bdc679f/src/kudu/util/char_util.cc
--
This message was sent by Atlassian Jira
(v8.20.10#820010)