Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17780 )
Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions ...................................................................... Patch Set 4: (6 comments) Looks great. My only concern is whether we just want to mask ascii characters, or Unicode characters. For the former a simpler algorithm exists for the method MastSubStrUtf8(). We may need to decide the scope of the change (ascii characters only, or Unicode characters in general) first. http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc File be/src/exprs/mask-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@122 PS3, Line 122: char_cnt > We are collecting code points at index range [start, end - 1) here, so shou Done http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@123 PS3, Line 123: uint32_t c = utf8_codecvt<char>::to_unicode(cvt_state, p, p_end); : if (CheckAndWarnCodePoint(ctx, c)) return StringVal::null(); > These are reading new code point from the string. Do you mean we should ref Got it. Done. http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@142 PS3, Line 142: for (uint32_t c : masked_code_points) { : uint32_t width = utf8_codecvt<char>::from_unicode(cvt_state, c, ptr, p_end); : DCHECK(width != utf::illegal && width != utf::incomplete); : ptr += width; : DCHECK(ptr <= p_end); : } > I think we do need to iterate the result code points twice. The first pass Thanks a lot for the explanation. I think we may not need 'result' of StringVal since in this patch we are not masking any non-ascii Unicode characters (see MaskTransform()). All we do is to change certain characters (byte for byte). So we could make a copy of source, and directly update the substring in question and thus save some CPU cycles and extra memory here. Unicode standard does cover more upper, lower and title letters (see for example https://www.compart.com/en/unicode/category/Lu). If we want to pursuit the masking of those in this patch, then this method is okay. We do need to beef up MaskTransform(). http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@160 PS3, Line 160: GetUtf8CodePointCo > Sorry that we are calculating the number of code points, not bytes here. I Done http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@238 PS3, Line 238: int start = GetUtf8CodePointCount(ctx, val) - mask_char_count; : if (start < 0) start = 0; > Yes, this is intended and is consistent with Hive. If 'mask_char_count' is Done http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@388 PS3, Line 388: string msg = Substitute("$0 unicode code point found in the beginning of $1", : c == utf::illegal ? "Illegal" : "Incomplete", AnyValUtil::ToString(str)); > Sorry that this is for another purpose. 'str' here is the whole string inst Done -- To view, visit http://gerrit.cloudera.org:8080/17780 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5 Gerrit-Change-Number: 17780 Gerrit-PatchSet: 4 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Comment-Date: Mon, 30 Aug 2021 15:49:20 +0000 Gerrit-HasComments: Yes
