Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )
Change subject: IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask functions ...................................................................... Patch Set 3: (10 comments) Looks good to me! http://gerrit.cloudera.org:8080/#/c/17780/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17780/3//COMMIT_MSG@10 PS3, Line 10: by turning on the UTF-8 mode, i.e. set UTF8_MODE=true nit. I wonder if this is true. Or you mean to say "This patch provides consistent masking behavior with Hive for strings under the UTF-8 mode, i.e., set UTF8_MODE=true.". http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc File be/src/exprs/mask-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@114 PS3, Line 114: CheckAndWarnCodePoint I wonder if there is a need to check here. Looks like we call Utf8CodePointCount() in the first place. If we reach here, the string is good. http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@122 PS3, Line 122: char_cnt I think this should be reset to 0 prior to the WHILE loop. http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@123 PS3, Line 123: uint32_t c = utf8_codecvt<char>::to_unicode(cvt_state, p, p_end); : if (CheckAndWarnCodePoint(ctx, c)) return StringVal::null(); Redundant code to code block at line 112? http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@142 PS3, Line 142: for (uint32_t c : masked_code_points) { : uint32_t width = utf8_codecvt<char>::from_unicode(cvt_state, c, ptr, p_end); : DCHECK(width != utf::illegal && width != utf::incomplete); : ptr += width; : DCHECK(ptr <= p_end); : } nit. Maybe we can put the work in this loop in the loop about after line 126. So masked_code_points can become converted_code_points. Furthermore, converted_code_points can be declared as std::string() and we append converted code points to it. http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@159 PS3, Line 159: for nit. remove. http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@160 PS3, Line 160: Utf8CodePointCount nit. Maybe name it as GetUtf8ByteCount(). http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@170 PS3, Line 170: AnyValUtil::ToString(val)).c_str() This string should be placed as the 2nd argument to produce a better message. For example, The 1-th code point <xxxx> is illegal. http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@238 PS3, Line 238: int start = Utf8CodePointCount(ctx, val) - mask_char_count; : if (start < 0) start = 0; I wonder if this is intended. Report error immediately when start == -1? http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@388 PS3, Line 388: string msg = Substitute("$0 unicode code point found in the beginning of $1", : c == utf::illegal ? "Illegal" : "Incomplete", AnyValUtil::ToString(str)); same comment above by inserting the code point string immediately after "unicode code point". -- To view, visit http://gerrit.cloudera.org:8080/17780 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5 Gerrit-Change-Number: 17780 Gerrit-PatchSet: 3 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Comment-Date: Mon, 23 Aug 2021 16:06:27 +0000 Gerrit-HasComments: Yes
