[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Qifan Chen (Code Review) Thu, 15 Jul 2021 07:59:38 -0700

Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17580 )


Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and 
locate()
......................................................................


Patch Set 9:

(3 comments)

Looks great!

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc@273
PS9, Line 273: if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt
nit. Performance-wise, I wonder if we can skip all its #bytes when the first 
byte of a UTF8 char is recognized.

 for (int i = 0; i < len; ) {
    if (BitUtil::IsUtf8StartByte(ptr[i])) {
      ++cnt;
      i += #uTF8bytes(ptr[i])
    } eles {
       i++;
    }


http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.h
File be/src/util/string-util.h:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.h@79
PS9, Line 79: the last second
nit. second last


http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/util/string-util.cc@116
PS9, Line 116: while (pos >= 0) {
             :     // Point to the start byte of the last character.
             :     while (!BitUtil::IsUtf8StartByte(ptr[pos]) && pos >= 0) 
--pos;
             :     if (pos < 0) {
             :       // Can't find any legal characters. Count each byte from 
last_pos as one character.
             :       // Note that index is 0-based.
             :       if (index < last_pos) return last_pos - index - 1;
             :       return -1;
             :     }
             :     // Get bytes length of the located character.
             :     int bytes_len = BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
             :     // If there are not enough bytes after the first byte, i.e. 
last_pos-pos < bytes_len,
             :     // we consider the bytes belong to a malformed character, 
and count them as one
             :     // character.
             :     int malformed_bytes = max(last_pos - pos - bytes_len, 0);
             :     if (index < malformed_bytes) {
             :       // Count each malformed bytes as one character.
             :       return last_pos - index - 1;
             :     }
             :     // We found a legal character and 'malformed_bytes' 
malformed characters.
             :     // At this point, index >= malformed_bytes. So the lowest 
value of the updated index
             :     // is -1, which means 'pos' points at what we want.
             :     index -= malformed_bytes + 1;
             :     if (index < 0) return pos;
             :     last_pos = pos;
             :     --pos;
             :   }
nit. Compare to FindUtf8PosForward(), this loop looks complex, partially due to 
the first inner loop at line 118 that can over-count # of illegal characters 
beyond the desired number ('index').

I think if we can do something like FindUtf8PosForward(), this function can be 
simplified a little bit.

The BNF grammar for the parsing work is

list of <CHAR>

<CHAR> := <UTF8_CHAR> | <ILLEGAL_CHAR>

So the loop can be modeled as follows.

while (index > 0 && pos >= 0) {
  pos = find_one_CHAR_backwards(pos)-1;
  index--
}

// Find one CHAR in the range [pos-3, pos] and
// return the position of the first byte of the
// character.
int find_one_CHAR_backwards(int pos) {
  int last_pos = pos;
  for (int i=0; i<4; i++) {
    if (BitUtil::IsUtf8StartByte(ptr[pos])) return pos;
    pos--;
    if (pos<0) break;
  }
  // non UTF8 character
  return last_pos;
}



--
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Thu, 15 Jul 2021 14:59:04 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()

Reply via email to