Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17580 )
Change subject: IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate() ...................................................................... IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate() Similar to the previous patch, this patch adds UTF-8 support in instr() and locate() builtin functions so they can have consistent behaviors with Hive's. These two string functions both have an optional argument as position: INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]]) LOCATE(STRING substr, STRING str[, INT pos]) Their return values are positions of the matched substring. In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are counted by UTF-8 characters instead of bytes. Error handling: Malformed UTF-8 characters are counted as one byte per character. This is consistent with Hive since Hive replaces those bytes to U+FFFD (REPLACEMENT CHARACTER). E.g. GenericUDFInstr calls Text#toString(), which performs the replacement. We can provide more behaviors on error handling like ignoring them or reporting errors. IMPALA-10761 will focus on this. Tests: - Add BE unit tests and e2e tests - Add random tests to make sure malformed UTF-8 characters won't crash us. Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662 Reviewed-on: http://gerrit.cloudera.org:8080/17580 Reviewed-by: Impala Public Jenkins <[email protected]> Tested-by: Impala Public Jenkins <[email protected]> --- M be/src/exprs/expr-test.cc M be/src/exprs/string-functions-ir.cc M be/src/util/CMakeLists.txt M be/src/util/bit-util.h M be/src/util/string-util-test.cc M be/src/util/string-util.cc M be/src/util/string-util.h M testdata/workloads/functional-query/queries/QueryTest/utf8-string-functions.test 8 files changed, 410 insertions(+), 27 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/17580 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662 Gerrit-Change-Number: 17580 Gerrit-PatchSet: 15 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]>
