Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )
Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate() ...................................................................... Patch Set 8: (3 comments) > Patch Set 7: > > My only expectation about malformed strings is that we shouldn't crash/hang. > It would be great to return the same results as Hive, but this may turn out > to be complex/inherently slow. > > A way to catch unlucky cases that lead to crash would be to add a test that > passes random strings (with malformed bytes included). Agreed. Added more tests including random test. http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/exprs/string-functions-ir.cc File be/src/exprs/string-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/exprs/string-functions-ir.cc@269 PS5, Line 269: CountUtf8Chars > This can return different values compared to the original implementation of Good question. I did an experiment in Hive and it shows up results of our old behavior. I'll rollback this change due to this. CC @Qifan. http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util-test.cc File be/src/util/string-util-test.cc: http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util-test.cc@155 PS5, Line 155: EXPECT_EQ(10, FindUtf8PosForward("李小龙 \x93\x93 ", 10, 5)); > Can you add a test for the "all malformed" case here and for FindUtf8PosBac Done http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc File be/src/util/string-util.cc: http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@105 PS5, Line 105: pos += BitUtil::NumBytesInUTF8Encoding(ptr[pos]); > Nice catching! One wrong length will mess up the whole processing since we Done in PS8. If there are less bytes than expected, the next character will be messed up with this malformed one, and being counted as one character with this. It would be inconsistent with Hive but I think it's ok for now. I just add a check at the end to deal with pos > len. -- To view, visit http://gerrit.cloudera.org:8080/17580 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662 Gerrit-Change-Number: 17580 Gerrit-PatchSet: 8 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Thu, 15 Jul 2021 08:53:44 +0000 Gerrit-HasComments: Yes
