This is an automated email from the ASF dual-hosted git repository.

HappenLee pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris.git


The following commit(s) were added to refs/heads/master by this push:
     new d5fb1e54964 [fix](be) Fix NOT_IMPLEMENTED_ERROR for length() on 
dict-encoded varchar columns (#63498)
d5fb1e54964 is described below

commit d5fb1e54964abe8da141880e9e0372ef61699997
Author: HappenLee <[email protected]>
AuthorDate: Fri May 22 17:49:53 2026 +0800

    [fix](be) Fix NOT_IMPLEMENTED_ERROR for length() on dict-encoded varchar 
columns (#63498)
    
    Problem Summary:
    `SELECT length(col_varchar)` on a low-cardinality (dict-encoded) column
    failed with:
    NOT_IMPLEMENTED_ERROR: Method insert_offsets_from_lengths is not
    supported
    
    Root cause (two-step):
    1. When `enable_low_cardinality_optimize=true`, the predicate column is
    created as `ColumnDictI32` by `Schema::get_predicate_column_ptr()`. The
    `only_read_offsets` code path in `BinaryDictPageDecoder` resolves dict
    codes to lengths and calls `dst->insert_offsets_from_lengths()`, but
    `ColumnDictI32` does not implement that method.
    2. After converting `ColumnDictI32` to
    `PredicateColumnType<TYPE_STRING>` via
    `convert_to_predicate_column_if_dictionary()`, the same call failed
    again because `PredicateColumnType` also lacked the implementation.
    
    Fix:
    - `binary_dict_page.cpp`: call
    `convert_to_predicate_column_if_dictionary()` on `dst` before invoking
    `insert_offsets_from_lengths` in both `next_batch()` and
    `read_by_rowids()` (covers dict-encoded pages and the plain-encoded
    fallback path).
    - `predicate_column.h`: implement `insert_offsets_from_lengths` for the
    `StringRef` specialisation of `PredicateColumnType`. A single backing
    buffer is allocated from the internal Arena and zero-filled; each
    element records the correct length so that downstream
    `filter_by_selector` / `copy_column_data_by_selector` can materialise
    the right-sized strings into the output `ColumnString`, giving the
    correct result for `length()`.
    
    ### Release note
    
    Fix crash/error when calling `length()` on a varchar/char column that
    uses the low-cardinality (dictionary) optimisation.
    
    ### Check List (For Author)
    
    - Test: Manual test (query reproduced and confirmed fixed)
    - Behavior changed: No
    - Does this need documentation: No
    
    ### What problem does this PR solve?
    
    Issue Number: close #xxx
    
    Related PR: #xxx
    
    Problem Summary:
    
    ### Release note
    
    None
    
    ### Check List (For Author)
    
    - Test <!-- At least one of them must be included. -->
        - [ ] Regression test
        - [ ] Unit Test
        - [ ] Manual test (add detailed scripts or steps below)
        - [ ] No need to test or manual test. Explain why:
    - [ ] This is a refactor/code format and no logic has been changed.
            - [ ] Previous test can cover this change.
            - [ ] No code files have been changed.
            - [ ] Other reason <!-- Add your reason?  -->
    
    - Behavior changed:
        - [ ] No.
        - [ ] Yes. <!-- Explain the behavior change -->
    
    - Does this need documentation?
        - [ ] No.
    - [ ] Yes. <!-- Add document PR link here. eg:
    https://github.com/apache/doris-website/pull/1214 -->
    
    ### Check List (For Reviewer who merge this PR)
    
    - [ ] Confirm the release note
    - [ ] Confirm test cases
    - [ ] Confirm document
    - [ ] Add branch pick label <!-- Add branch pick label that this PR
    should merge into -->
    
    ---------
    
    Co-authored-by: Copilot <[email protected]>
---
 be/src/core/column/predicate_column.h              |  35 +++++++
 be/src/storage/segment/binary_dict_page.cpp        |   7 ++
 .../string_functions/test_length_dict_encoded.out  |  55 +++++++++++
 .../test_length_dict_encoded.groovy                | 107 +++++++++++++++++++++
 4 files changed, 204 insertions(+)

diff --git a/be/src/core/column/predicate_column.h 
b/be/src/core/column/predicate_column.h
index d98743500db..2dbea5da684 100644
--- a/be/src/core/column/predicate_column.h
+++ b/be/src/core/column/predicate_column.h
@@ -293,6 +293,41 @@ public:
         }
     }
 
+    // Insert `num` entries with only length information (no actual char data).
+    // The chars buffer is zero-filled so that filter_by_selector can safely
+    // memcpy without reading meaningful content. Used in OFFSET_ONLY reading
+    // mode where only string lengths (for length() function) are needed.
+    void insert_offsets_from_lengths(const uint32_t* lengths, size_t num) 
override {
+        if constexpr (std::is_same_v<T, StringRef>) {
+            if (UNLIKELY(num == 0)) {
+                return;
+            }
+            size_t total_bytes = 0;
+            for (size_t i = 0; i < num; ++i) {
+                total_bytes += lengths[i];
+            }
+            // Allocate and zero-fill a single backing buffer so that each 
StringRef
+            // points to valid (though meaningless) memory. filter_by_selector 
will
+            // memcpy from these pointers, so they must not be null for 
non-zero lengths.
+            char* buf = total_bytes > 0 ? _arena.alloc(total_bytes) : nullptr;
+            if (total_bytes > 0) {
+                memset(buf, 0, total_bytes);
+            }
+            size_t org_elem_num = data.size();
+            data.resize(org_elem_num + num);
+            size_t offset = 0;
+            for (size_t i = 0; i < num; ++i) {
+                // For zero-length strings, data pointer is null; 
insert_many_strings
+                // and filter_by_selector both guard on size > 0 before 
dereferencing.
+                data[org_elem_num + i].data = (lengths[i] > 0) ? (buf + 
offset) : nullptr;
+                data[org_elem_num + i].size = lengths[i];
+                offset += lengths[i];
+            }
+        } else {
+            IColumn::insert_offsets_from_lengths(lengths, num);
+        }
+    }
+
     void insert_default() override { data.push_back(T()); }
 
     void clear() override {
diff --git a/be/src/storage/segment/binary_dict_page.cpp 
b/be/src/storage/segment/binary_dict_page.cpp
index fdbf8914ad4..2a4438550ea 100644
--- a/be/src/storage/segment/binary_dict_page.cpp
+++ b/be/src/storage/segment/binary_dict_page.cpp
@@ -290,6 +290,10 @@ Status BinaryDictPageDecoder::next_batch(size_t* n, 
MutableColumnPtr& dst) {
     if (_options.only_read_offsets) {
         // OFFSET_ONLY mode: resolve dict codes to get real string lengths
         // without copying actual char data. This allows length() to work.
+        // ColumnDictI32 does not implement insert_offsets_from_lengths, so 
convert
+        // it to a predicate column (ColumnString) first. This is a no-op for
+        // non-dictionary columns and for ColumnNullable it converts the 
nested column.
+        dst = dst->convert_to_predicate_column_if_dictionary();
         const auto* data_array = reinterpret_cast<const 
int32_t*>(_bit_shuffle_ptr->get_data(0));
         size_t start_index = _bit_shuffle_ptr->_cur_index;
         // Reuse _buffer (int32_t vector) to store uint32_t lengths.
@@ -334,6 +338,9 @@ Status BinaryDictPageDecoder::read_by_rowids(const rowid_t* 
rowids, ordinal_t pa
     if (_options.only_read_offsets) {
         // OFFSET_ONLY mode: resolve dict codes to get real string lengths
         // without copying actual char data. This allows length() to work 
correctly.
+        // ColumnDictI32 does not implement insert_offsets_from_lengths, so 
convert
+        // it to a predicate column (ColumnString) first.
+        dst = dst->convert_to_predicate_column_if_dictionary();
         const auto* data_array = reinterpret_cast<const 
int32_t*>(_bit_shuffle_ptr->get_data(0));
         size_t read_count = 0;
         _buffer.resize(total);
diff --git 
a/regression-test/data/query_p0/sql_functions/string_functions/test_length_dict_encoded.out
 
b/regression-test/data/query_p0/sql_functions/string_functions/test_length_dict_encoded.out
new file mode 100644
index 00000000000..684a50de7b9
--- /dev/null
+++ 
b/regression-test/data/query_p0/sql_functions/string_functions/test_length_dict_encoded.out
@@ -0,0 +1,55 @@
+-- This file is automatically generated. You should know what you did if you 
want to edit this
+-- !length_nullable_not_null --
+1
+1
+1
+1
+1
+2
+
+-- !length_not_null_col --
+1
+1
+1
+1
+1
+1
+1
+1
+
+-- !length_with_nulls --
+\N     50
+\N     51
+1      24
+1      28
+1      30
+1      41
+1      5
+2      60
+
+-- !char_length_nullable_not_null --
+1
+1
+1
+1
+1
+2
+
+-- !length_nullable_is_null --
+\N
+\N
+
+-- !char_length_all_rows --
+\N     50
+\N     51
+1      24
+1      28
+1      30
+1      41
+1      5
+2      60
+
+-- !char_length_nullable_is_null --
+\N
+\N
+
diff --git 
a/regression-test/suites/query_p0/sql_functions/string_functions/test_length_dict_encoded.groovy
 
b/regression-test/suites/query_p0/sql_functions/string_functions/test_length_dict_encoded.groovy
new file mode 100644
index 00000000000..6f64c5b9218
--- /dev/null
+++ 
b/regression-test/suites/query_p0/sql_functions/string_functions/test_length_dict_encoded.groovy
@@ -0,0 +1,107 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+// Regression test for: length() / char_length() on dict-encoded 
(low-cardinality)
+// varchar columns must not throw NOT_IMPLEMENTED_ERROR when the 
"only_read_offsets"
+// optimisation is active (i.e. when the storage layer resolves dict codes to 
string
+// lengths without materialising actual character data).
+suite("test_length_dict_encoded") {
+    sql "DROP TABLE IF EXISTS test_length_dict_varchar"
+    sql """
+        CREATE TABLE test_length_dict_varchar (
+          col_int_undef_signed           int          NULL,
+          col_int_undef_signed_not_null  int          NOT NULL,
+          col_date_undef_signed          date         NULL,
+          col_date_undef_signed_not_null date         NOT NULL,
+          col_varchar_5__undef_signed    varchar(5)   NULL,
+          col_varchar_5__undef_signed_not_null varchar(5) NOT NULL,
+          pk                             int          NULL
+        ) ENGINE=OLAP
+        DUPLICATE KEY(col_int_undef_signed)
+        PARTITION BY RANGE(col_int_undef_signed) (
+          PARTITION p0   VALUES [('-2147483648'), ('4')),
+          PARTITION p1   VALUES [('4'),  ('6')),
+          PARTITION p2   VALUES [('6'),  ('7')),
+          PARTITION p3   VALUES [('7'),  ('8')),
+          PARTITION p4   VALUES [('8'),  ('10')),
+          PARTITION p5   VALUES [('10'), ('83647')),
+          PARTITION p100 VALUES [('83647'), ('2147483647'))
+        )
+        DISTRIBUTED BY HASH(pk) BUCKETS 10
+        PROPERTIES ('replication_allocation' = 'tag.location.default: 1')
+    """
+
+    sql """
+        INSERT INTO test_length_dict_varchar VALUES
+          (6,     5,        '2023-12-13', '2023-12-11', 'o',  'i', 30),
+          (6,     6,        NULL,         '2023-12-18', 'w',  'l', 24),
+          (8,     -8278102, '2023-12-13', '2023-12-11', 'x',  'c', 28),
+          (15971, 8,        NULL,         '2015-06-11', 'h',  'r', 41),
+          (6,     5,        '2023-12-11', '2023-12-17', 'd',  'q',  5),
+          (7,     100,      '2023-12-14', '2023-12-15', NULL, 'a', 50),
+          (7,     101,      '2023-12-15', '2023-12-16', NULL, 'b', 51),
+          (9,     200,      NULL,         '2023-12-17', 'ab', 'd', 60)
+    """
+
+    // length() on nullable dict-encoded varchar filtered by IS NOT NULL 
predicate
+    order_qt_length_nullable_not_null """
+        SELECT length(col_varchar_5__undef_signed)
+        FROM   test_length_dict_varchar
+        WHERE  col_varchar_5__undef_signed IS NOT NULL
+    """
+
+    // length() on NOT NULL dict-encoded varchar (no predicate needed)
+    order_qt_length_not_null_col """
+        SELECT length(col_varchar_5__undef_signed_not_null)
+        FROM   test_length_dict_varchar
+    """
+
+    // length() returns NULL for NULL varchar values
+    order_qt_length_with_nulls """
+        SELECT length(col_varchar_5__undef_signed), pk
+        FROM   test_length_dict_varchar
+        ORDER BY pk
+    """
+
+    // char_length() is a synonym and must work as well
+    order_qt_char_length_nullable_not_null """
+        SELECT char_length(col_varchar_5__undef_signed)
+        FROM   test_length_dict_varchar
+        WHERE  col_varchar_5__undef_signed IS NOT NULL
+    """
+
+    // length() returns NULL for NULL values — verify NULL passthrough
+    order_qt_length_nullable_is_null """
+        SELECT length(col_varchar_5__undef_signed)
+        FROM   test_length_dict_varchar
+        WHERE  col_varchar_5__undef_signed IS NULL
+    """
+
+    // char_length() on all rows including NULLs — verify dict code path with 
mixed null/non-null
+    order_qt_char_length_all_rows """
+        SELECT char_length(col_varchar_5__undef_signed), pk
+        FROM   test_length_dict_varchar
+        ORDER  BY pk
+    """
+
+    // char_length() on NULL-only rows
+    order_qt_char_length_nullable_is_null """
+        SELECT char_length(col_varchar_5__undef_signed)
+        FROM   test_length_dict_varchar
+        WHERE  col_varchar_5__undef_signed IS NULL
+    """
+}


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to