Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )
Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate() ...................................................................... Patch Set 5: (6 comments) Thanks for the feedback! Addressed the comments. http://gerrit.cloudera.org:8080/#/c/17580/4/be/src/util/string-util-test.cc File be/src/util/string-util-test.cc: http://gerrit.cloudera.org:8080/#/c/17580/4/be/src/util/string-util-test.cc@150 PS4, Line 150: // Here we just need 4 characters, i.e. "李小龙 " in byte length 10. Set 'str_len' > line too long (93 > 90) Done http://gerrit.cloudera.org:8080/#/c/17580/4/be/src/util/string-util-test.cc@199 PS4, Line 199: // Here we just need 4 characters, i.e. "李小龙 " in byte length 10. Set 'str_len' > line too long (93 > 90) Done http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@102 PS3, Line 102: --index; > I wonder if we can remove --index here, since index is for the ith Unicode Sorry that we need to be consistent with Hive that replaces all illegal UTF8 bytes with U+FFFD (REPLACEMENT CHARACTER). I moved the comments to string-util.h based on the previous comment. Past it here: /// Error handling: /// Malformed UTF8 characters are counted as one byte per character. This is consistent /// with Hive since Hive replaces those bytes to U+FFFD (REPLACEMENT CHARACTER). /// E.g. GenericUDFInstr calls Text#toString(), which performs the replacement. /// TODO(IMPALA-10761): Add query option to control the error handling. We can introduce the ignoring behavior in IMPALA-10761. Do you think it's ok? http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@118 PS3, Line 118: // Get bytes length of the located character. > Above line 118, we need to do the following to guard the case that a prefix Oops! Thanks for catching this! Added test cases to cover this. http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@120 PS3, Line 120: int malformed_bytes = last_pos > If index counts only the legal UTF8 characters, it probably should not be c Replied above http://gerrit.cloudera.org:8080/#/c/17580/3/be/src/util/string-util.cc@125 PS3, Line 125: / We found a legal character > same comment as above. Replied above -- To view, visit http://gerrit.cloudera.org:8080/17580 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662 Gerrit-Change-Number: 17580 Gerrit-PatchSet: 5 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Thu, 24 Jun 2021 07:13:58 +0000 Gerrit-HasComments: Yes
