[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Quanlong Huang (Code Review) Sun, 29 Aug 2021 23:39:35 -0700

Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17780 )


Change subject: IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask 
functions
......................................................................


Patch Set 4:

(10 comments)

Thanks for your review, Qifan! Addressed the comments.

http://gerrit.cloudera.org:8080/#/c/17780/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17780/3//COMMIT_MSG@10
PS3, Line 10: under the UTF-8 mode, i.e., set UTF8_MODE=true. In UT
> nit. I wonder if this is true. Or you mean to say "This patch provides cons
Yeah, done.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc
File be/src/exprs/mask-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@114
PS3, Line 114: CheckAndWarnCodePoint
> I wonder if there is a need to check here. Looks like we call Utf8CodePoint
In some places we don't call Utf8CodePointCount(), e.g. in 
MaskShowFirstNImpl(). So we still need this.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@122
PS3, Line 122: char_cnt
> I think this should be reset to 0 prior to the WHILE loop.
We are collecting code points at index range [start, end - 1) here, so should 
not reset 'char_cnt'.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@123
PS3, Line 123:  uint32_t c = utf8_codecvt<char>::to_unicode(cvt_state, p, 
p_end);
             :     if (CheckAndWarnCodePoint(ctx, c)) return StringVal::null();
> Redundant code to code block at line 112?
These are reading new code point from the string. Do you mean we should 
refactor these two-lines codes into a method?


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@142
PS3, Line 142: for (uint32_t c : masked_code_points) {
             :     uint32_t width = utf8_codecvt<char>::from_unicode(cvt_state, 
c, ptr, p_end);
             :     DCHECK(width != utf::illegal && width != utf::incomplete);
             :     ptr += width;
             :     DCHECK(ptr <= p_end);
             :   }
> nit. Maybe we can put the work in this loop in the loop about after line 12
I think we do need to iterate the result code points twice. The first pass is 
used to calculate the result byte length. Then we allocate the result StringVal 
and write bytes into it in the second pass.

We can do sth like what we do in StringFunctions::Replace() that guess the 
result byte length and double/reallocate it if it's not enough. But I think the 
current approach is better that we won't waste any mem space, though a bit 
slower.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@159
PS3, Line 159: if
> nit. remove.
Done


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@160
PS3, Line 160: GetUtf8CodePointCo
> nit. Maybe name it as GetUtf8ByteCount().
Sorry that we are calculating the number of code points, not bytes here. I 
renamed it to GetUtf8CodePointCount().


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@170
PS3, Line 170: c == utf::illegal ? "illegal" : "i
> This string should be placed as the 2nd argument to produce a better messag
Done


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@238
PS3, Line 238:  int start = GetUtf8CodePointCount(ctx, val) - mask_char_count;
             :   if (start < 0) start = 0;
> I wonder if this is intended. Report error immediately when start == -1?
Yes, this is intended and is consistent with Hive. If 'mask_char_count' is 
larger than the number of all characters, we mask all characters.

 0: jdbc:hive2://localhost:11050> select mask_last_n('abcd', 10);
 xxxx


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@388
PS3, Line 388:  string msg = Substitute("$0 unicode code point found in the 
beginning of $1",
             :         c == utf::illegal ? "Illegal" : "Incomplete", 
AnyValUtil::ToString(str));
> same comment above by inserting the code point string immediately after "un
Sorry that this is for another purpose. 'str' here is the whole string instead 
of the first code point. This function is used to get the first code point from 
a string, e.g. "abc" => "a".



--
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Comment-Date: Mon, 30 Aug 2021 06:39:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9662,IMPALA-2019(part-3): Support UTF-8 mode in mask functions

Reply via email to