Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16909 )

Change subject: IMPALA-5675: Support UTF-8 Varchar and Char types
......................................................................


Patch Set 18:

(8 comments)

Resolved the remaining comments.

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/kudu-util.cc
File be/src/exec/kudu-util.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/kudu-util.cc@150
PS14, Line 150: EAN);
> This is another example on utf8 mode. If utf8 is a property of the data whi
Ack


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/parquet/parquet-common.h@222
PS14, Line 222: 967
> For non BYTE_ARRAY type, it is better to supply 'false' directly here.
Done


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/text-converter.inline.h
File be/src/exec/text-converter.inline.h:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exec/text-converter.inline.h@100
PS14, Line 100:  if (type.type == TYPE_CHAR) {
              :         // Truncates the string if in UTF-8 mode.
              :         if (type.is_utf8) str.len = FindUtf8PosForward(str.ptr, 
str.len, type.char_len);
              :         StringValue::PadWithSpaces(str.ptr, buffer_len, 
str.len);
              :         str.len = buffer_len;
              :       } else if (type.type == TYPE_VARCHAR && type.is_utf8) {
              :         // Truncates the string if in UTF-8 mode.
              :         str.len = FindUtf8PosForward(str.ptr, str.len, 
type.char_len);
              :       }
> Can we reuse CastFunctions::CastToChar(FunctionContext* ctx, const StringVa
CastFunctions::CastToChar() is the implementation of CAST(x as CHAR(N)). It's 
more complex than this case. Here we are materializing a CHAR/VARCHAR slot. For 
a CHAR slot, its size is always char_len * 4, i.e. 'buffer_len'. We can simply 
fill the remaining bytes with whitespaces. However, in 
CastFunctions::CastToChar(), we are returning a StringVal which has variable 
length. We need to calculate the actual characters in the CHAR(N) string, and 
then append the accurate amount of missing spaces. So it's hard to reuse 
CastFunctions::CastToChar() here.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/agg-fn-evaluator.cc
File be/src/exprs/agg-fn-evaluator.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/agg-fn-evaluator.cc@284
PS14, Line 284: intermediate type
> Good question! From the comment below, the immediate type can be created fo
I think we can consider that if we do have a new agg function needs it in the 
future.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/udf/udf.h
File be/src/udf/udf.h:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/udf/udf.h@106
PS14, Line 106: int
> Maybe indicate it is the length in characters.
Changed this to char_len since we don't need storage_len/byte_len in it.


http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/Analyzer.java
File fe/src/main/java/org/apache/impala/analysis/Analyzer.java:

http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@2813
PS14, Line 2813: titute(nullS
> This probably can create confusions as the interpretation of the column cha
As discussed offline, column/table property on character set is not supported 
even in Hive. It also introduces the burdon of upgrading existing tables and 
modifying existing workloads/queries at the same time. Let's target on it in 
the future. Feel free to file a JIRA for it.


http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
File fe/src/main/java/org/apache/impala/analysis/CastExpr.java:

http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@65
PS14, Line 65: isUtf8_ = e.isUtf8_;
> Should we check the targetType here too? For example, for CAST(c as INT), i
If the target type is not a string type, the isUtf8 marker won't be used. So I 
think it's ok to leave it as-is.


http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@105
PS14, Line 105: 8_ = e.isU
> Same comment as above.
Same reply as above.



--
To view, visit http://gerrit.cloudera.org:8080/16909
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f
Gerrit-Change-Number: 16909
Gerrit-PatchSet: 18
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Sat, 28 May 2022 10:34:45 +0000
Gerrit-HasComments: Yes

Reply via email to