Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )
Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions ...................................................................... Patch Set 4: (10 comments) Thanks for your review, Qifan! Addressed the comments. http://gerrit.cloudera.org:8080/#/c/17780/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17780/3//COMMIT_MSG@10 PS3, Line 10: under the UTF-8 mode, i.e., set UTF8_MODE=true. In UT > nit. I wonder if this is true. Or you mean to say "This patch provides cons Yeah, done. http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc File be/src/exprs/mask-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@114 PS3, Line 114: CheckAndWarnCodePoint > I wonder if there is a need to check here. Looks like we call Utf8CodePoint In some places we don't call Utf8CodePointCount(), e.g. in MaskShowFirstNImpl(). So we still need this. http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@122 PS3, Line 122: char_cnt > I think this should be reset to 0 prior to the WHILE loop. We are collecting code points at index range [start, end - 1) here, so should not reset 'char_cnt'. http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@123 PS3, Line 123: uint32_t c = utf8_codecvt<char>::to_unicode(cvt_state, p, p_end); : if (CheckAndWarnCodePoint(ctx, c)) return StringVal::null(); > Redundant code to code block at line 112? These are reading new code point from the string. Do you mean we should refactor these two-lines codes into a method? http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@142 PS3, Line 142: for (uint32_t c : masked_code_points) { : uint32_t width = utf8_codecvt<char>::from_unicode(cvt_state, c, ptr, p_end); : DCHECK(width != utf::illegal && width != utf::incomplete); : ptr += width; : DCHECK(ptr <= p_end); : } > nit. Maybe we can put the work in this loop in the loop about after line 12 I think we do need to iterate the result code points twice. The first pass is used to calculate the result byte length. Then we allocate the result StringVal and write bytes into it in the second pass. We can do sth like what we do in StringFunctions::Replace() that guess the result byte length and double/reallocate it if it's not enough. But I think the current approach is better that we won't waste any mem space, though a bit slower. http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@159 PS3, Line 159: if > nit. remove. Done http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@160 PS3, Line 160: GetUtf8CodePointCo > nit. Maybe name it as GetUtf8ByteCount(). Sorry that we are calculating the number of code points, not bytes here. I renamed it to GetUtf8CodePointCount(). http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@170 PS3, Line 170: c == utf::illegal ? "illegal" : "i > This string should be placed as the 2nd argument to produce a better messag Done http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@238 PS3, Line 238: int start = GetUtf8CodePointCount(ctx, val) - mask_char_count; : if (start < 0) start = 0; > I wonder if this is intended. Report error immediately when start == -1? Yes, this is intended and is consistent with Hive. If 'mask_char_count' is larger than the number of all characters, we mask all characters. 0: jdbc:hive2://localhost:11050> select mask_last_n('abcd', 10); xxxx http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@388 PS3, Line 388: string msg = Substitute("$0 unicode code point found in the beginning of $1", : c == utf::illegal ? "Illegal" : "Incomplete", AnyValUtil::ToString(str)); > same comment above by inserting the code point string immediately after "un Sorry that this is for another purpose. 'str' here is the whole string instead of the first code point. This function is used to get the first code point from a string, e.g. "abc" => "a". -- To view, visit http://gerrit.cloudera.org:8080/17780 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5 Gerrit-Change-Number: 17780 Gerrit-PatchSet: 4 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Mon, 30 Aug 2021 06:39:27 +0000 Gerrit-HasComments: Yes
