Hello Qifan Chen, Tim Armstrong, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/16909
to look at the new patch set (#19).
Change subject: IMPALA-5675: Support UTF-8 Varchar and Char types
......................................................................
IMPALA-5675: Support UTF-8 Varchar and Char types
This patch adds support for UTF-8 aware varchar and char types. In
UTF-8 mode, when truncating UTF-8 varchar(N) and char(N) strings,
lengths will be counted by characters instead of bytes. So the
result string will have up to N characters.
The UTF8_MODE query option is first detected in FE when analyzing the
query. A 'is_utf8' label is added in Exprs and SlotDescriptors. They are
used in generating thrift objects and computing the tuple layouts. A
char(N) slot will occupy 4 * N bytes if it's in UTF-8 type, because a
UTF-8 character can be encoded into 1~4 bytes. The slot will store up to
N characters.
There is a gotcha that we should not add the label in Type.java, because
Type instances are shared across the FE. Query compilation reuses the
Type instances from the metadata. If we modify Type instances during
compilation, other queries in non-UTF8 mode will be affected.
However, in BE, we need the type related classes (e.g. ColumnType,
TypeDesc) to carry in the utf8 markers. It's impractical to check the
UTF8_MODE query option everywhere it needs to be. E.g. in
AnyValUtil::SetAnyVal we can't access the query options. So we add the
'is_utf8' marker in TScalarType, ColumnType, TypeDesc to conveniently
recognize char(N) and varchar(N) types in UTF-8 mode. When generating
thrift objects in FE, Exprs and SlotDescriptors deliver 'is_utf8'
markers to TScalaTypes. They finally landed in ColumnType and TypeDesc
instances.
Given the correct UTF-8 mode checked, we just need to truncate/pad the
char/varchar strings with their length counted by characters. To make
sure we don't miss any places, this patch change ColmunType's 'len'
field to two separated fields, 'char_len' and 'byte_len', representing
the length in characters and length in bytes. Codes should explicitly
use the desired length.
Since char(N) slots always occupy 4N bytes, when converting char(N) to
other string types, we need to re-calculate the actual length in bytes
corresponding to N characters. We can optimize this in later patches,
e.g. store the actual length in the slot, or deal with UTF-8 char(N) in
the same way as varchar(N), i.e. reallocate the string space and just
store the pointer and length in the slot.
This patch won't deal with invalid UTF-8 characters. It will be the
focus of IMPALA-10761.
Tests:
- Add tests for reading char(N) and varchar(N) columns in UTF8_MODE.
- Add truncating/padding tests
- Kudu only supports Varchar currently. Add special tests for Kudu.
- Add tests for writing CHAR(N)/VARCHAR(N) in UTF-8 mode.
- Add test dimension on UTF8_MODE in test_chars.py
Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f
---
M be/src/codegen/codegen-anyval.cc
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/llvm-codegen.cc
M be/src/exec/data-source-scan-node.cc
M be/src/exec/grouping-aggregator.cc
M be/src/exec/hdfs-avro-scanner-ir.cc
M be/src/exec/hdfs-avro-scanner-test.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-table-sink.cc
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-stats.inline.h
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-data-converter.h
M be/src/exec/parquet/parquet-plain-test.cc
M be/src/exec/text-converter.cc
M be/src/exec/text-converter.inline.h
M be/src/exprs/agg-fn-evaluator.cc
M be/src/exprs/anyval-util.cc
M be/src/exprs/anyval-util.h
M be/src/exprs/cast-functions-ir.cc
M be/src/exprs/conditional-functions-ir.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/exprs/expr-codegen-test.cc
M be/src/exprs/literal.cc
M be/src/exprs/math-functions-ir.cc
M be/src/exprs/operators-ir.cc
M be/src/exprs/scalar-expr-evaluator.cc
M be/src/exprs/scalar-fn-call.cc
M be/src/exprs/slot-ref.cc
M be/src/exprs/string-functions-ir.cc
M be/src/exprs/utility-functions-ir.cc
M be/src/runtime/raw-value-ir.cc
M be/src/runtime/raw-value.cc
M be/src/runtime/raw-value.inline.h
M be/src/runtime/string-value.h
M be/src/runtime/tuple.cc
M be/src/runtime/types.cc
M be/src/runtime/types.h
M be/src/service/fe-support.cc
M be/src/service/hs2-util.cc
M be/src/testutil/test-udas.cc
M be/src/udf/udf-internal.h
M be/src/udf/udf.cc
M be/src/udf/udf.h
M be/src/util/dict-encoding.h
M be/src/util/in-list-filter-ir.cc
M be/src/util/in-list-filter.cc
M be/src/util/in-list-filter.h
M be/src/util/string-util-test.cc
M be/src/util/string-util.cc
M be/src/util/string-util.h
M be/src/util/tuple-row-compare.cc
M common/thrift/Types.thrift
M fe/src/main/java/org/apache/impala/analysis/Analyzer.java
M fe/src/main/java/org/apache/impala/analysis/CastExpr.java
M fe/src/main/java/org/apache/impala/analysis/Expr.java
M fe/src/main/java/org/apache/impala/analysis/SlotDescriptor.java
M fe/src/main/java/org/apache/impala/analysis/SlotRef.java
M fe/src/main/java/org/apache/impala/analysis/TupleDescriptor.java
M fe/src/main/java/org/apache/impala/catalog/ScalarType.java
M fe/src/main/java/org/apache/impala/catalog/Type.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/kudu_create.test
A testdata/workloads/functional-query/queries/QueryTest/utf8-chars-casting.test
A testdata/workloads/functional-query/queries/QueryTest/utf8-chars-insert.test
A testdata/workloads/functional-query/queries/QueryTest/utf8-chars.test
M tests/query_test/test_chars.py
M tests/query_test/test_utf8_strings.py
79 files changed, 1,144 insertions(+), 285 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/09/16909/19
--
To view, visit http://gerrit.cloudera.org:8080/16909
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I62efa3042c64d1d005a2cf4fd1d31e992543963f
Gerrit-Change-Number: 16909
Gerrit-PatchSet: 19
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>