Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17580 )
Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate() ...................................................................... Patch Set 9: (1 comment) http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc File be/src/exprs/string-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/17580/9/be/src/exprs/string-functions-ir.cc@273 PS9, Line 273: if (BitUtil::IsUtf8StartByte(ptr[i])) ++cnt > nit. Performance-wise, I wonder if we can skip all its #bytes when the firs As we discussed in https://gerrit.cloudera.org/c/17580/5/be/src/exprs/string-functions-ir.cc#269 this can't handle malformed characters that have less bytes than expected. BTW, for performance, the current version can use SIMD to speed up, i.e. check first bit of all bytes, if all of them are 0, just return the length (we can do this in another JIRA). -- To view, visit http://gerrit.cloudera.org:8080/17580 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662 Gerrit-Change-Number: 17580 Gerrit-PatchSet: 9 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Thu, 15 Jul 2021 23:09:57 +0000 Gerrit-HasComments: Yes
