Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17780 )

Change subject: IMPALA-9662,IMPALA-2019(part-4): Support UTF-8 mode in mask 
functions
......................................................................


Patch Set 3:

(10 comments)

Looks good to me!

http://gerrit.cloudera.org:8080/#/c/17780/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17780/3//COMMIT_MSG@10
PS3, Line 10: by turning on the UTF-8 mode, i.e. set UTF8_MODE=true
nit. I wonder if this is true. Or you mean to say "This patch provides 
consistent masking behavior with Hive for
strings under the UTF-8 mode, i.e., set UTF8_MODE=true.".


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc
File be/src/exprs/mask-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@114
PS3, Line 114: CheckAndWarnCodePoint
I wonder if there is a need to check here. Looks like we call 
Utf8CodePointCount() in the first place. If we reach here, the string is good.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@122
PS3, Line 122: char_cnt
I think this should be reset to 0 prior to the WHILE loop.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@123
PS3, Line 123:  uint32_t c = utf8_codecvt<char>::to_unicode(cvt_state, p, 
p_end);
             :     if (CheckAndWarnCodePoint(ctx, c)) return StringVal::null();
Redundant code to code block at line 112?


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@142
PS3, Line 142: for (uint32_t c : masked_code_points) {
             :     uint32_t width = utf8_codecvt<char>::from_unicode(cvt_state, 
c, ptr, p_end);
             :     DCHECK(width != utf::illegal && width != utf::incomplete);
             :     ptr += width;
             :     DCHECK(ptr <= p_end);
             :   }
nit. Maybe we can put the work in this loop in the loop about after line 126. 
So masked_code_points can become converted_code_points. Furthermore, 
converted_code_points can be declared as std::string() and we append converted 
code points to it.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@159
PS3, Line 159: for
nit. remove.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@160
PS3, Line 160: Utf8CodePointCount
nit. Maybe name it as GetUtf8ByteCount().


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@170
PS3, Line 170: AnyValUtil::ToString(val)).c_str()
This string should be placed as the 2nd argument to produce a better message. 
For example,

The 1-th code point <xxxx> is illegal.


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@238
PS3, Line 238:  int start = Utf8CodePointCount(ctx, val) - mask_char_count;
             :   if (start < 0) start = 0;
I wonder if this is intended. Report error immediately when start == -1?


http://gerrit.cloudera.org:8080/#/c/17780/3/be/src/exprs/mask-functions-ir.cc@388
PS3, Line 388:  string msg = Substitute("$0 unicode code point found in the 
beginning of $1",
             :         c == utf::illegal ? "Illegal" : "Incomplete", 
AnyValUtil::ToString(str));
same comment above by inserting the code point string immediately after 
"unicode code point".



--
To view, visit http://gerrit.cloudera.org:8080/17780
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1276eccc94c9528507349b155a51e76f338367d5
Gerrit-Change-Number: 17780
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Comment-Date: Mon, 23 Aug 2021 16:06:27 +0000
Gerrit-HasComments: Yes

Reply via email to