Quanlong Huang created IMPALA-13367:
---------------------------------------

             Summary: Improve performance in counting UTF8 string length
                 Key: IMPALA-13367
                 URL: https://issues.apache.org/jira/browse/IMPALA-13367
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
            Reporter: Quanlong Huang


In UTF-8 mode (i.e. set utf8_mode=true), we count string length using the start 
byte of the UTF-8 character. In be/src/exprs/string-functions-ir.cc:
{code:cpp}
static int CountUtf8Chars(uint8_t* ptr, int len) {
  if (ptr == nullptr) return 0;
  int cnt = 0;
  for (int i = 0; i < len; ++i) {
    if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt;
  }
  return cnt;
}{code}
We can leverage SIMD instructions to improve this:
 * Check if all bytes in a string are all ASCII characters.
 * If so, return the length directly. Otherwise, use the normal code path.

In most of the cases, strings only have ASCII characters so this helps. Here is 
an example in Kudu: 
https://github.com/apache/kudu/blob/86bdc679f/src/kudu/util/char_util.cc



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to