Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16909 )
Change subject: IMPALA-5675: Support UTF-8 Varchar and Char ...................................................................... Patch Set 13: (5 comments) I looked at the backend change and the approach seems sane. I think an alternative for CHAR would be to switch to storing it in as a variable-length slot, i.e. StringValue, but to retain the padding logic. The approach you choose probably requires fewer code changes, which is good for a legacy data type that is hopefully rarely used. http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG@7 PS13, Line 7: IMPALA-5675: Support UTF-8 Varchar and Char One minor thing : I think we should comment the len field, e.g. in StringValue StringVal, ColumnType, TypeDescriptor to be clear whether it means the length in bytes or the length in characters. http://gerrit.cloudera.org:8080/#/c/16909/13//COMMIT_MSG@45 PS13, Line 45: In the backend, how did you identify all the places you needed to change the code? I can think of a number of places where the behaviour depends on the CHAR/VARCHAR length * Cast functions to CHAR/VARCHAR * Cast functions from STRING to CHAR, where we need to identify the "true" length of the CHAR. Although if we switched to the truncating behaviour suggested in IMPALA-1652, that might be the same for both code paths. * Scanners where we read and may need to pad or truncate according to the data type. * Expression evaluation where we turn a CHAR slot into a StringVal and a StringVal into a CHAR slot * Both the codegen'd and interpreted codepaths of the above. Generally I think my concern is that we've caught all the places and that we're testing the code. http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exec/kudu-scanner.cc File be/src/exec/kudu-scanner.cc: http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exec/kudu-scanner.cc@392 PS13, Line 392: if (!state_->utf8_mode()) { Should we add a DCHECK in UTF-8 mode to verify that the strings returned from Kudu do not have > the required number of characters. http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/cast-functions-ir.cc File be/src/exprs/cast-functions-ir.cc: http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/cast-functions-ir.cc@218 PS13, Line 218: // TODO: Will timestamp strings have non-ASCII characters? I don't think they could? Custom timestamp formats could have them, but that doesn't apply here- you can't have a custom format in a cast. http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/scalar-expr-evaluator.cc File be/src/exprs/scalar-expr-evaluator.cc: http://gerrit.cloudera.org:8080/#/c/16909/13/be/src/exprs/scalar-expr-evaluator.cc@316 PS13, Line 316: // TODO: consider returning reference of the StringVal in UTF-8 mode. You might have to be careful here with UDFs, since they could return a len > the slot size. -- To view, visit http://gerrit.cloudera.org:8080/16909 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f Gerrit-Change-Number: 16909 Gerrit-PatchSet: 13 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Fri, 12 Feb 2021 22:08:45 +0000 Gerrit-HasComments: Yes
