Qifan Chen has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16909 )

Change subject: IMPALA-5675: Support UTF-8 Varchar and Char types
......................................................................


Patch Set 14:

(26 comments)

Started the review maybe 1/3 way through. Will continue tomorrow.

Overall, it looks good to me.

One major concern is the use of utf8 mode query option. It may be OK for debug 
purpose. In practice, however it may be better to rely on table property (utf8 
or non utf8) to make the whole thing automatic.

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/anyval-util.h
File be/src/exprs/anyval-util.h:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/anyval-util.h@310
PS14, Line 310: sv->ptr = const_cast<uint8_t*>(reinterpret_cast<const 
uint8_t*>(slot));
              :         sv->len = (type.is_utf8 && type.type == TYPE_CHAR)?
              :             FindUtf8Pos(reinterpret_cast<const char*>(slot), 
type.len * 4, type.len)
              :             : type.len;
Maybe separate the two cases to make the code cleaner.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/slot-ref.cc
File be/src/exprs/slot-ref.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/slot-ref.cc@249
PS14, Line 249: type_.type == TYPE_FIXED_UDA_INTERMEDIATE
I thought TYPE_FIXED_UDA_INTERMEDIATE is utf8 irrelevant, no?


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/exprs/slot-ref.cc@393
PS14, Line 393: )
I thought TYPE_FIXED_UDA_INTERMEDIATE is utf8 irrelevant, no?


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/runtime/string-value.h
File be/src/runtime/string-value.h:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/runtime/string-value.h@45
PS14, Line 45:   /// The length of the string in bytes.
May append "regardless whether it is binary or UTF8".


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/runtime/types.h
File be/src/runtime/types.h:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/runtime/types.h@73
PS14, Line 73: /// It's the logical length of the type. E.g. CHAR(3) has len=3 
regardless of whether
             :   /// we are in UTF-8 mode
Elsewhere in the code, it seems 'len' is interpreted as length in bytes for 
VARCHAR. For example, GetMaxStrLen() is defined only for CHAR.

Suggest to distinguish between the length in characters and the length in 
bytes. For example, use 'len' for length in characters, and add a new data 
member 'len_in_bytes' to facilitate computation with char/varchar/string types.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/runtime/types.h@119
PS14, Line 119: int len
Add a comment on the unit of 'len'.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/runtime/types.h@129
PS14, Line 129: int len,
Add a comment on the unit of 'len'.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/runtime/types.h@297
PS14, Line 297: GetMaxStrLen
It will be good if we pay the cost of storing the length in bytes once 
(computing it once in the cstr) to avoid the computation many times.

The method could be named as GetStorageLen() or GetStorageSize().


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/runtime/types.cc
File be/src/runtime/types.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/runtime/types.cc@52
PS14, Line 52: type == TYPE_FIXED_UDA_INTERMEDIATE
Consider make a special case for TYPE_FIXED_UDA_INTEMEDIATE which cares only 
the len.

For CHAR, VARCHAR and STRING, do both len and utf8.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/runtime/types.cc@333
PS14, Line 333: STRING UTF8
Probably should be STRING(UTF8).


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/udf/udf-internal.h
File be/src/udf/udf-internal.h:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/udf/udf-internal.h@148
PS14, Line 148: ,
May add a comment "to return storage size".


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/udf/udf.h
File be/src/udf/udf.h:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/udf/udf.h@106
PS14, Line 106: int
Maybe indicate it is the length in characters.

Ideally, we can make use of a LENGTH_TYPE (instead of int) here.

struct LENGTH_TYPE {
  int char_len;
  int storage_len;
}


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/util/bit-util.h
File be/src/util/bit-util.h:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/util/bit-util.h@133
PS14, Line 133: >
Should be ">=" as 0x00 is a single byte UTF8 character.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/util/bit-util.h@138
PS14, Line 138: }
May add a DCHECK(false) here.


http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/util/string-util.cc
File be/src/util/string-util.cc:

http://gerrit.cloudera.org:8080/#/c/16909/14/be/src/util/string-util.cc@96
PS14, Line 96:  if (utf8_cnt != nullptr) *utf8_cnt = 0;
             :   if (len == 0) return 0;
These two lines can be removed.


http://gerrit.cloudera.org:8080/#/c/16909/14/common/thrift/Types.thrift
File common/thrift/Types.thrift:

http://gerrit.cloudera.org:8080/#/c/16909/14/common/thrift/Types.thrift@66
PS14, Line 66: E.g. CHAR(3) has len=3.
May add: len is the number of characters for CHAR or VARCHAR.


http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/Analyzer.java
File fe/src/main/java/org/apache/impala/analysis/Analyzer.java:

http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@2813
PS14, Line 2813: isUtf8Mode()
This probably can create confusions as the interpretation of the column 
character set (binary or UTF8) is by this query option. For most HDFS tables, 
maybe we can use table property such that the character set is a property of 
the table.


http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java
File fe/src/main/java/org/apache/impala/analysis/CastExpr.java:

http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@65
PS14, Line 65: isUtf8_ = e.isUtf8_;
Should we check the targetType here too? For example, for CAST(c as INT), 
isUTF8 does not apply to the type of CAST().


http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/CastExpr.java@105
PS14, Line 105: 8_ = e.isU
Same comment as above.


http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
File fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java:

http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java@296
PS14, Line 296: if (d.getType().isScalarType(PrimitiveType.CHAR) && 
d.isUtf8Mode()) slotSize *= 4;
It probably is a good idea to move this logic to ScalarType where CHAR data 
types are handled.


http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/catalog/ScalarType.java
File fe/src/main/java/org/apache/impala/catalog/ScalarType.java:

http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/catalog/ScalarType.java@42
PS14, Line 42: size, i.e. CHAR, VARCHAR
better reworded as "by number of characters, for CHAR and VCHAR".

Just wonder if we need a new field storage_len_ (the actual storage length in 
bytes).


http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/catalog/Type.java
File fe/src/main/java/org/apache/impala/catalog/Type.java:

http://gerrit.cloudera.org:8080/#/c/16909/14/fe/src/main/java/org/apache/impala/catalog/Type.java@278
PS14, Line 278: public static TColumnType addUtf8Markers(TColumnType container, 
boolean isUtf8) {
              :     for (TTypeNode t : container.types) {
              :       if (t.type == TTypeNodeType.SCALAR) {
              :         t.scalar_type.setIs_utf8(isUtf8);
              :       }
              :     }
              :     return container;
              :   }
Sounds like this should be a method on TColumnType.

Since isUTF8 can be false, sounds like the method should be named as 
propagateUTF8Markers().


http://gerrit.cloudera.org:8080/#/c/16909/14/testdata/workloads/functional-query/queries/QueryTest/kudu_create.test
File testdata/workloads/functional-query/queries/QueryTest/kudu_create.test:

http://gerrit.cloudera.org:8080/#/c/16909/14/testdata/workloads/functional-query/queries/QueryTest/kudu_create.test@484
PS14, Line 484: kudu;
May consider other table format such as PARQUET, ORC etc.


http://gerrit.cloudera.org:8080/#/c/16909/14/testdata/workloads/functional-query/queries/QueryTest/utf8-chars-casting.test
File 
testdata/workloads/functional-query/queries/QueryTest/utf8-chars-casting.test:

http://gerrit.cloudera.org:8080/#/c/16909/14/testdata/workloads/functional-query/queries/QueryTest/utf8-chars-casting.test@4
PS14, Line 4: "李小龙"
May consider use single quote as double quote is for delimitated identifiers in 
ANSI SQL.


http://gerrit.cloudera.org:8080/#/c/16909/14/testdata/workloads/functional-query/queries/QueryTest/utf8-chars-casting.test@4
PS14, Line 4: "李小龙"
Consider adding literals with a combination of different utf8 lengths, such as 
cast("¢?€E" as T), where T is char(n), varhchar(n) and string. See 
table"Representation of UTF-8 characters" in 
https://en.wikipedia.org/wiki/UTF-8.


http://gerrit.cloudera.org:8080/#/c/16909/14/testdata/workloads/functional-query/queries/QueryTest/utf8-chars-casting.test@14
PS14, Line 14: "李小龙" as char(2)
Consider some other cast examples:  char-> non-char.

Example: cast('?' as int).



--
To view, visit http://gerrit.cloudera.org:8080/16909
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f
Gerrit-Change-Number: 16909
Gerrit-PatchSet: 14
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Comment-Date: Mon, 22 Feb 2021 17:19:04 +0000
Gerrit-HasComments: Yes

Reply via email to