This is an automated email from the ASF dual-hosted git repository.
HappenLee pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/master by this push:
new d5fb1e54964 [fix](be) Fix NOT_IMPLEMENTED_ERROR for length() on
dict-encoded varchar columns (#63498)
d5fb1e54964 is described below
commit d5fb1e54964abe8da141880e9e0372ef61699997
Author: HappenLee <[email protected]>
AuthorDate: Fri May 22 17:49:53 2026 +0800
[fix](be) Fix NOT_IMPLEMENTED_ERROR for length() on dict-encoded varchar
columns (#63498)
Problem Summary:
`SELECT length(col_varchar)` on a low-cardinality (dict-encoded) column
failed with:
NOT_IMPLEMENTED_ERROR: Method insert_offsets_from_lengths is not
supported
Root cause (two-step):
1. When `enable_low_cardinality_optimize=true`, the predicate column is
created as `ColumnDictI32` by `Schema::get_predicate_column_ptr()`. The
`only_read_offsets` code path in `BinaryDictPageDecoder` resolves dict
codes to lengths and calls `dst->insert_offsets_from_lengths()`, but
`ColumnDictI32` does not implement that method.
2. After converting `ColumnDictI32` to
`PredicateColumnType<TYPE_STRING>` via
`convert_to_predicate_column_if_dictionary()`, the same call failed
again because `PredicateColumnType` also lacked the implementation.
Fix:
- `binary_dict_page.cpp`: call
`convert_to_predicate_column_if_dictionary()` on `dst` before invoking
`insert_offsets_from_lengths` in both `next_batch()` and
`read_by_rowids()` (covers dict-encoded pages and the plain-encoded
fallback path).
- `predicate_column.h`: implement `insert_offsets_from_lengths` for the
`StringRef` specialisation of `PredicateColumnType`. A single backing
buffer is allocated from the internal Arena and zero-filled; each
element records the correct length so that downstream
`filter_by_selector` / `copy_column_data_by_selector` can materialise
the right-sized strings into the output `ColumnString`, giving the
correct result for `length()`.
### Release note
Fix crash/error when calling `length()` on a varchar/char column that
uses the low-cardinality (dictionary) optimisation.
### Check List (For Author)
- Test: Manual test (query reproduced and confirmed fixed)
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
### Release note
None
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [ ] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
https://github.com/apache/doris-website/pull/1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
---------
Co-authored-by: Copilot <[email protected]>
---
be/src/core/column/predicate_column.h | 35 +++++++
be/src/storage/segment/binary_dict_page.cpp | 7 ++
.../string_functions/test_length_dict_encoded.out | 55 +++++++++++
.../test_length_dict_encoded.groovy | 107 +++++++++++++++++++++
4 files changed, 204 insertions(+)
diff --git a/be/src/core/column/predicate_column.h
b/be/src/core/column/predicate_column.h
index d98743500db..2dbea5da684 100644
--- a/be/src/core/column/predicate_column.h
+++ b/be/src/core/column/predicate_column.h
@@ -293,6 +293,41 @@ public:
}
}
+ // Insert `num` entries with only length information (no actual char data).
+ // The chars buffer is zero-filled so that filter_by_selector can safely
+ // memcpy without reading meaningful content. Used in OFFSET_ONLY reading
+ // mode where only string lengths (for length() function) are needed.
+ void insert_offsets_from_lengths(const uint32_t* lengths, size_t num)
override {
+ if constexpr (std::is_same_v<T, StringRef>) {
+ if (UNLIKELY(num == 0)) {
+ return;
+ }
+ size_t total_bytes = 0;
+ for (size_t i = 0; i < num; ++i) {
+ total_bytes += lengths[i];
+ }
+ // Allocate and zero-fill a single backing buffer so that each
StringRef
+ // points to valid (though meaningless) memory. filter_by_selector
will
+ // memcpy from these pointers, so they must not be null for
non-zero lengths.
+ char* buf = total_bytes > 0 ? _arena.alloc(total_bytes) : nullptr;
+ if (total_bytes > 0) {
+ memset(buf, 0, total_bytes);
+ }
+ size_t org_elem_num = data.size();
+ data.resize(org_elem_num + num);
+ size_t offset = 0;
+ for (size_t i = 0; i < num; ++i) {
+ // For zero-length strings, data pointer is null;
insert_many_strings
+ // and filter_by_selector both guard on size > 0 before
dereferencing.
+ data[org_elem_num + i].data = (lengths[i] > 0) ? (buf +
offset) : nullptr;
+ data[org_elem_num + i].size = lengths[i];
+ offset += lengths[i];
+ }
+ } else {
+ IColumn::insert_offsets_from_lengths(lengths, num);
+ }
+ }
+
void insert_default() override { data.push_back(T()); }
void clear() override {
diff --git a/be/src/storage/segment/binary_dict_page.cpp
b/be/src/storage/segment/binary_dict_page.cpp
index fdbf8914ad4..2a4438550ea 100644
--- a/be/src/storage/segment/binary_dict_page.cpp
+++ b/be/src/storage/segment/binary_dict_page.cpp
@@ -290,6 +290,10 @@ Status BinaryDictPageDecoder::next_batch(size_t* n,
MutableColumnPtr& dst) {
if (_options.only_read_offsets) {
// OFFSET_ONLY mode: resolve dict codes to get real string lengths
// without copying actual char data. This allows length() to work.
+ // ColumnDictI32 does not implement insert_offsets_from_lengths, so
convert
+ // it to a predicate column (ColumnString) first. This is a no-op for
+ // non-dictionary columns and for ColumnNullable it converts the
nested column.
+ dst = dst->convert_to_predicate_column_if_dictionary();
const auto* data_array = reinterpret_cast<const
int32_t*>(_bit_shuffle_ptr->get_data(0));
size_t start_index = _bit_shuffle_ptr->_cur_index;
// Reuse _buffer (int32_t vector) to store uint32_t lengths.
@@ -334,6 +338,9 @@ Status BinaryDictPageDecoder::read_by_rowids(const rowid_t*
rowids, ordinal_t pa
if (_options.only_read_offsets) {
// OFFSET_ONLY mode: resolve dict codes to get real string lengths
// without copying actual char data. This allows length() to work
correctly.
+ // ColumnDictI32 does not implement insert_offsets_from_lengths, so
convert
+ // it to a predicate column (ColumnString) first.
+ dst = dst->convert_to_predicate_column_if_dictionary();
const auto* data_array = reinterpret_cast<const
int32_t*>(_bit_shuffle_ptr->get_data(0));
size_t read_count = 0;
_buffer.resize(total);
diff --git
a/regression-test/data/query_p0/sql_functions/string_functions/test_length_dict_encoded.out
b/regression-test/data/query_p0/sql_functions/string_functions/test_length_dict_encoded.out
new file mode 100644
index 00000000000..684a50de7b9
--- /dev/null
+++
b/regression-test/data/query_p0/sql_functions/string_functions/test_length_dict_encoded.out
@@ -0,0 +1,55 @@
+-- This file is automatically generated. You should know what you did if you
want to edit this
+-- !length_nullable_not_null --
+1
+1
+1
+1
+1
+2
+
+-- !length_not_null_col --
+1
+1
+1
+1
+1
+1
+1
+1
+
+-- !length_with_nulls --
+\N 50
+\N 51
+1 24
+1 28
+1 30
+1 41
+1 5
+2 60
+
+-- !char_length_nullable_not_null --
+1
+1
+1
+1
+1
+2
+
+-- !length_nullable_is_null --
+\N
+\N
+
+-- !char_length_all_rows --
+\N 50
+\N 51
+1 24
+1 28
+1 30
+1 41
+1 5
+2 60
+
+-- !char_length_nullable_is_null --
+\N
+\N
+
diff --git
a/regression-test/suites/query_p0/sql_functions/string_functions/test_length_dict_encoded.groovy
b/regression-test/suites/query_p0/sql_functions/string_functions/test_length_dict_encoded.groovy
new file mode 100644
index 00000000000..6f64c5b9218
--- /dev/null
+++
b/regression-test/suites/query_p0/sql_functions/string_functions/test_length_dict_encoded.groovy
@@ -0,0 +1,107 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Regression test for: length() / char_length() on dict-encoded
(low-cardinality)
+// varchar columns must not throw NOT_IMPLEMENTED_ERROR when the
"only_read_offsets"
+// optimisation is active (i.e. when the storage layer resolves dict codes to
string
+// lengths without materialising actual character data).
+suite("test_length_dict_encoded") {
+ sql "DROP TABLE IF EXISTS test_length_dict_varchar"
+ sql """
+ CREATE TABLE test_length_dict_varchar (
+ col_int_undef_signed int NULL,
+ col_int_undef_signed_not_null int NOT NULL,
+ col_date_undef_signed date NULL,
+ col_date_undef_signed_not_null date NOT NULL,
+ col_varchar_5__undef_signed varchar(5) NULL,
+ col_varchar_5__undef_signed_not_null varchar(5) NOT NULL,
+ pk int NULL
+ ) ENGINE=OLAP
+ DUPLICATE KEY(col_int_undef_signed)
+ PARTITION BY RANGE(col_int_undef_signed) (
+ PARTITION p0 VALUES [('-2147483648'), ('4')),
+ PARTITION p1 VALUES [('4'), ('6')),
+ PARTITION p2 VALUES [('6'), ('7')),
+ PARTITION p3 VALUES [('7'), ('8')),
+ PARTITION p4 VALUES [('8'), ('10')),
+ PARTITION p5 VALUES [('10'), ('83647')),
+ PARTITION p100 VALUES [('83647'), ('2147483647'))
+ )
+ DISTRIBUTED BY HASH(pk) BUCKETS 10
+ PROPERTIES ('replication_allocation' = 'tag.location.default: 1')
+ """
+
+ sql """
+ INSERT INTO test_length_dict_varchar VALUES
+ (6, 5, '2023-12-13', '2023-12-11', 'o', 'i', 30),
+ (6, 6, NULL, '2023-12-18', 'w', 'l', 24),
+ (8, -8278102, '2023-12-13', '2023-12-11', 'x', 'c', 28),
+ (15971, 8, NULL, '2015-06-11', 'h', 'r', 41),
+ (6, 5, '2023-12-11', '2023-12-17', 'd', 'q', 5),
+ (7, 100, '2023-12-14', '2023-12-15', NULL, 'a', 50),
+ (7, 101, '2023-12-15', '2023-12-16', NULL, 'b', 51),
+ (9, 200, NULL, '2023-12-17', 'ab', 'd', 60)
+ """
+
+ // length() on nullable dict-encoded varchar filtered by IS NOT NULL
predicate
+ order_qt_length_nullable_not_null """
+ SELECT length(col_varchar_5__undef_signed)
+ FROM test_length_dict_varchar
+ WHERE col_varchar_5__undef_signed IS NOT NULL
+ """
+
+ // length() on NOT NULL dict-encoded varchar (no predicate needed)
+ order_qt_length_not_null_col """
+ SELECT length(col_varchar_5__undef_signed_not_null)
+ FROM test_length_dict_varchar
+ """
+
+ // length() returns NULL for NULL varchar values
+ order_qt_length_with_nulls """
+ SELECT length(col_varchar_5__undef_signed), pk
+ FROM test_length_dict_varchar
+ ORDER BY pk
+ """
+
+ // char_length() is a synonym and must work as well
+ order_qt_char_length_nullable_not_null """
+ SELECT char_length(col_varchar_5__undef_signed)
+ FROM test_length_dict_varchar
+ WHERE col_varchar_5__undef_signed IS NOT NULL
+ """
+
+ // length() returns NULL for NULL values — verify NULL passthrough
+ order_qt_length_nullable_is_null """
+ SELECT length(col_varchar_5__undef_signed)
+ FROM test_length_dict_varchar
+ WHERE col_varchar_5__undef_signed IS NULL
+ """
+
+ // char_length() on all rows including NULLs — verify dict code path with
mixed null/non-null
+ order_qt_char_length_all_rows """
+ SELECT char_length(col_varchar_5__undef_signed), pk
+ FROM test_length_dict_varchar
+ ORDER BY pk
+ """
+
+ // char_length() on NULL-only rows
+ order_qt_char_length_nullable_is_null """
+ SELECT char_length(col_varchar_5__undef_signed)
+ FROM test_length_dict_varchar
+ WHERE col_varchar_5__undef_signed IS NULL
+ """
+}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]