Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17580 )

Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and 
locate()
......................................................................


Patch Set 8:

(3 comments)

> Patch Set 7:
>
> My only expectation about malformed strings is that we shouldn't crash/hang. 
> It would be great to return the same results as Hive, but this may turn out 
> to be complex/inherently slow.
>
> A way to catch unlucky cases that lead to crash would be to add a test that 
> passes random strings (with malformed bytes included).

Agreed. Added more tests including random test.

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/exprs/string-functions-ir.cc
File be/src/exprs/string-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/exprs/string-functions-ir.cc@269
PS5, Line 269: CountUtf8Chars
> This can return different values compared to the original implementation of
Good question. I did an experiment in Hive and it shows up results of our old 
behavior. I'll rollback this change due to this. CC @Qifan.


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util-test.cc
File be/src/util/string-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util-test.cc@155
PS5, Line 155:   EXPECT_EQ(10, FindUtf8PosForward("李小龙 \x93\x93 ", 10, 5));
> Can you add a test for the "all malformed" case here and for FindUtf8PosBac
Done


http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/17580/5/be/src/util/string-util.cc@105
PS5, Line 105:     pos += BitUtil::NumBytesInUTF8Encoding(ptr[pos]);
> Nice catching! One wrong length will mess up the whole processing since we
Done in PS8. If there are less bytes than expected, the next character will be 
messed up with this malformed one, and being counted as one character with 
this. It would be inconsistent with Hive but I think it's ok for now. I just 
add a check at the end to deal with pos > len.



--
To view, visit http://gerrit.cloudera.org:8080/17580
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Gerrit-Change-Number: 17580
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Thu, 15 Jul 2021 08:53:44 +0000
Gerrit-HasComments: Yes

Reply via email to