[Impala-ASF-CR] IMPALA-5675: Support UTF-8 Varchar and Char

Tim Armstrong (Code Review) Fri, 12 Feb 2021 14:08:53 -0800

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16909 )


Change subject: IMPALA-5675: Support UTF-8 Varchar and Char
......................................................................


Patch Set 13:

(5 comments)

I looked at the backend change and the approach seems sane.

I think an alternative for CHAR would be to switch to storing it in as a 
variable-length slot, i.e. StringValue, but to retain the padding logic. The 
approach you choose probably requires fewer code changes, which is good for a 
legacy data type that is hopefully rarely used.

http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG@7
PS13, Line 7: IMPALA-5675: Support UTF-8 Varchar and Char
One minor thing : I think we should comment the len field, e.g. in StringValue 
StringVal, ColumnType, TypeDescriptor to be clear whether it means the length 
in bytes or the length in characters.


http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG@45
PS13, Line 45:
In the backend, how did you identify all the places you needed to change the 
code? I can think of a number of places where the behaviour depends on the 
CHAR/VARCHAR length

* Cast functions to CHAR/VARCHAR
* Cast functions from STRING to CHAR, where we need to identify the "true" 
length of the CHAR. Although if we switched to the truncating behaviour 
suggested in IMPALA-1652, that might be the same for both code paths.
* Scanners where we read and may need to pad or truncate according to the data 
type.
* Expression evaluation where we turn a CHAR slot into a StringVal and a 
StringVal into a CHAR slot
* Both the codegen'd and interpreted codepaths of the above.

Generally I think my concern is that we've caught all the places and that we're 
testing the code.


http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exec/kudu-scanner.cc@392
PS13, Line 392:     if (!state_->utf8_mode()) {
Should we add a DCHECK in UTF-8 mode to verify that the strings returned from 
Kudu do not have > the required number of characters.


http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/cast-functions-ir.cc
File be/src/exprs/cast-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/cast-functions-ir.cc@218
PS13, Line 218:   // TODO: Will timestamp strings have non-ASCII characters?
I don't think they could? Custom timestamp formats could have them, but that 
doesn't apply here- you can't have a custom format in a cast.


http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/scalar-expr-evaluator.cc
File be/src/exprs/scalar-expr-evaluator.cc:

http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/scalar-expr-evaluator.cc@316
PS13, Line 316:       // TODO: consider returning reference of the StringVal in 
UTF-8 mode.
You might have to be careful here with UDFs, since they could return a len > 
the slot size.



--
To view, visit http://gerrit.cloudera.org:8080/16909
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f
Gerrit-Change-Number: 16909
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Fri, 12 Feb 2021 22:08:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5675: Support UTF-8 Varchar and Char

Reply via email to