[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Qifan Chen (Code Review) Mon, 30 Aug 2021 08:49:33 -0700

Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17780 )


Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask 
functions
......................................................................


Patch Set 4:

(6 comments)

Looks great. My only concern is whether we just want to mask ascii characters, 
or Unicode characters. For the former a simpler algorithm exists for the method 
MastSubStrUtf8().

We may need to decide the scope of the change (ascii characters only, or 
Unicode characters in general) first.

http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc
File be/src/exprs/mask-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@122
PS3, Line 122: char_cnt
> We are collecting code points at index range [start, end - 1) here, so shou
Done


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@123
PS3, Line 123:  uint32_t c = utf8_codecvt<char>::to_unicode(cvt_state, p, 
p_end);
             :     if (CheckAndWarnCodePoint(ctx, c)) return StringVal::null();
> These are reading new code point from the string. Do you mean we should ref
Got it. Done.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@142
PS3, Line 142: for (uint32_t c : masked_code_points) {
             :     uint32_t width = utf8_codecvt<char>::from_unicode(cvt_state, 
c, ptr, p_end);
             :     DCHECK(width != utf::illegal && width != utf::incomplete);
             :     ptr += width;
             :     DCHECK(ptr <= p_end);
             :   }
> I think we do need to iterate the result code points twice. The first pass
Thanks a lot for the explanation.

I think we may not need 'result' of StringVal since in this patch we are not 
masking any non-ascii Unicode characters (see MaskTransform()). All we do is to 
change certain characters (byte for byte). So we could make a copy of source, 
and directly update the substring in question and thus save some CPU cycles and 
extra memory here.

Unicode standard does cover more upper, lower and title letters (see for 
example https://www.compart.com/en/unicode/category/Lu). If we want to pursuit 
the masking of those in this patch, then this method is okay. We do need to 
beef up MaskTransform().


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@160
PS3, Line 160: GetUtf8CodePointCo
> Sorry that we are calculating the number of code points, not bytes here. I
Done


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@238
PS3, Line 238:  int start = GetUtf8CodePointCount(ctx, val) - mask_char_count;
             :   if (start < 0) start = 0;
> Yes, this is intended and is consistent with Hive. If 'mask_char_count' is
Done


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@388
PS3, Line 388:  string msg = Substitute("$0 unicode code point found in the 
beginning of $1",
             :         c == utf::illegal ? "Illegal" : "Incomplete", 
AnyValUtil::ToString(str));
> Sorry that this is for another purpose. 'str' here is the whole string inst
Done



--
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Mon, 30 Aug 2021 15:49:20 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Reply via email to