Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and 
locate()
......................................................................


Patch Set 5:

(6 comments)

Thanks for the feedback! Addressed the comments.

http://gerrit.cloudera.org:8080/#/c/17580/4/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/4/be/src/util/string-util-test.cc@150
PS4, Line 150:   // Here we just need 4 characters, i.e. "李小龙 " in byte length 
10. Set 'str_len'
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/17580/4/be/src/util/string-util-test.cc@199
PS4, Line 199:   // Here we just need 4 characters, i.e. "李小龙 " in byte length 
10. Set 'str_len'
> line too long (93 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@102
PS3, Line 102: --index;
> I wonder if we can remove --index here, since index is for the ith Unicode
Sorry that we need to be consistent with Hive that replaces all illegal UTF8 
bytes with U+FFFD (REPLACEMENT CHARACTER). I moved the comments to 
string-util.h based on the previous comment. Past it here:

/// Error handling:
/// Malformed UTF8 characters are counted as one byte per character. This is 
consistent
/// with Hive since Hive replaces those bytes to U+FFFD (REPLACEMENT CHARACTER).
/// E.g. GenericUDFInstr calls Text#toString(), which performs the replacement.
/// TODO(IMPALA-10761): Add query option to control the error handling.

We can introduce the ignoring behavior in IMPALA-10761. Do you think it's ok?


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@118
PS3, Line 118:     // Get bytes length of the located character.
> Above line 118, we need to do the following to guard the case that a prefix
Oops! Thanks for catching this! Added test cases to cover this.


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@120
PS3, Line 120:  int malformed_bytes = last_pos
> If index counts only the legal UTF8 characters, it probably should not be c
Replied above


http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@125
PS3, Line 125: / We found a legal character
> same comment as above.
Replied above



--
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Thu, 24 Jun 2021 07:13:58 +0000
Gerrit-HasComments: Yes

Reply via email to