Re: [PR] [refactor](be) remove CHAR padding on read [doris]

via GitHub Thu, 28 May 2026 23:55:38 -0700


github-actions[bot] commented on code in PR #63291:
URL: https://github.com/apache/doris/pull/63291#discussion_r3322648762



##########
be/src/core/data_type_serde/data_type_string_serde.cpp:
##########
@@ -462,25 +462,16 @@ Status 
DataTypeStringSerDeBase<ColumnType>::from_string(StringRef& str, IColumn&
 
 // Deserializes a STRING/VARCHAR/CHAR value from its OLAP string representation
 // (e.g. from ZoneMap protobuf). This is the inverse of to_olap_string().
-//
-// For CHAR type: if the string is shorter than the declared column length 
(_len),
-// pads with '\0' bytes to reach _len. This preserves CHAR's fixed-length 
semantics.
-// For STRING/VARCHAR: stores the string as-is.
-//
-// Examples:
-//   CHAR(10), str="hello"  => field = "hello\0\0\0\0\0" (10 bytes)
-//   VARCHAR,  str="hello"  => field = "hello" (5 bytes)
 template <typename ColumnType>
 Status DataTypeStringSerDeBase<ColumnType>::from_olap_string(const 
std::string& str, Field& field,
                                                              const 
FormatOptions& options) const {
-    if (cast_set<int>(str.size()) < _len) {
-        DCHECK_EQ(_type, TYPE_CHAR);
-        std::string tmp(_len, '\0');
-        memcpy(tmp.data(), str.data(), str.size());
-        field = Field::create_field<TYPE_CHAR>(std::move(tmp));
-    } else {
-        field = Field::create_field<TYPE_STRING>(str);
-    }
+    // CHAR(N) writes through OlapColumnDataConvertorChar are zero-padded to
+    // the declared schema length, so the serialized OLAP string carries
+    // trailing '\0' bytes. strnlen() drops that padding to surface the
+    // logical character content in the Field. VARCHAR / STRING never write
+    // trailing '\0' through this path, so strnlen is a no-op for them.
+    size_t len = strnlen(str.data(), str.size());
+    field = Field::create_field<TYPE_STRING>(std::string(str.data(), len));

Review Comment:
   This now truncates valid STRING/VARCHAR zone-map values that contain an 
embedded NUL byte. `to_olap_string()` returns the raw string bytes for all 
string types, so a VARCHAR value like `a\0z` can be persisted as the zone-map 
min/max. On read, `from_olap_string()` turns that into just `a`, so predicate 
zone-map evaluation can incorrectly prune a page containing `a\0z`. The 
trimming needs to be limited to CHAR padding semantics rather than using 
`strnlen()` unconditionally for every string type.



##########
be/src/storage/segment/binary_plain_page_v2_pre_decoder.h:
##########
@@ -37,127 +97,101 @@ namespace segment_v2 {
  * V1 format (output):
  *   Data: |binary1|binary2|...
  *   Trailer: |offset1(32-bit)|offset2(32-bit)|...| num_elems (32-bit)
+ *
+ * The decode pipeline is 7 steps:
+ *   1. parse header (validate sizes + extract num_elems + iteration bounds)
+ *   2. scan entries: record (data_start, out_len) per entry, sum total out_len
+ *   3. allocate the V1 output page (size_of_prefix + binary + offsets + 
trailer + tail)
+ *   4. write binary payload
+ *   5. write offsets array (running cursor over out_len)
+ *   6. write num_elems trailer
+ *   7. copy tail (footer + null map) and publish output params
+ *
+ * IS_CHAR=true picks the strnlen transform in step 2, so CHAR pages emit
+ * unpadded slices to the cached page. IS_CHAR=false keeps raw V2 lengths.
+ * The branch is `if constexpr` — compile-time dispatched, no overhead.
  */
+template <bool IS_CHAR>
 struct BinaryPlainPageV2PreDecoder : public DataPagePreDecoder {
-    /**
-     * @brief Decode BinaryPlainPageV2 data to BinaryPlainPage format
-     *
-     * @param page unique_ptr to hold page data, will be replaced by decoded 
data
-     * @param page_slice data to decode, will be updated to point to decoded 
data
-     * @param size_of_tail including size of footer and null map
-     * @param _use_cache whether to use page cache
-     * @param page_type the type of page
-     * @param file_path file path for error reporting
-     * @param size_of_prefix size of prefix space to reserve (for dict page 
header)
-     * @return Status
-     */
     Status decode(std::unique_ptr<DataPage>* page, Slice* page_slice, size_t 
size_of_tail,
                   bool _use_cache, segment_v2::PageTypePB page_type, const 
std::string& file_path,
                   size_t size_of_prefix = 0) override {
-        // Validate input
-        if (page_slice->size < sizeof(uint32_t) + size_of_tail) {
-            return Status::Corruption("Invalid page size: {}, expected at 
least {} in file: {}",
-                                      page_slice->size, sizeof(uint32_t) + 
size_of_tail, file_path);
+        // Step 1.
+        Slice data;
+        uint32_t num_elems = 0;
+        const uint8_t* ptr = nullptr;
+        const uint8_t* limit = nullptr;
+        RETURN_IF_ERROR(parse_header(*page_slice, size_of_tail, file_path, 
&data, &num_elems, &ptr,
+                                     &limit));
+
+        // Step 2: out_len derived per IS_CHAR.
+        std::vector<BinaryPlainV1Entry> entries;
+        entries.reserve(num_elems);
+        uint32_t total_out_len = 0;
+        for (uint32_t i = 0; i < num_elems; ++i) {
+            uint32_t raw_len = 0;
+            const uint8_t* data_start = nullptr;
+            RETURN_IF_ERROR(decode_one(ptr, limit, file_path, i, &data_start, 
&raw_len));
+            uint32_t out_len;
+            if constexpr (IS_CHAR) {
+                out_len = static_cast<uint32_t>(
+                        strnlen(reinterpret_cast<const char*>(data_start), 
raw_len));
+            } else {

Review Comment:
   Using `strnlen()` here removes bytes after the first NUL, not only the 
trailing zero padding. For a CHAR value whose logical bytes are `a\0b` in 
CHAR(4), the stored slice is `a\0b\0`; this pre-decoder rewrites it as `a` 
instead of `a\0b`. The same issue exists in the V1 CHAR pre-decoder. Please 
trim only the suffix padding zeros, not the first embedded NUL.



##########
be/src/storage/predicate/predicate_creator_comparison.cpp:
##########
@@ -77,21 +77,9 @@ std::shared_ptr<ColumnPredicate> 
create_comparison_predicate(const uint32_t cid,
                                                                            
opposite);
     }
     case TYPE_CHAR: {
-        auto target = std::max(cast_set<size_t>(assert_cast<const 
DataTypeString*>(
-                                                        
remove_nullable(data_type).get())
-                                                        ->len()),
-                               value.template get<TYPE_CHAR>().size());
-        if (target > value.template get<TYPE_CHAR>().size()) {
-            std::string tmp(target, '\0');
-            memcpy(tmp.data(), value.template get<TYPE_CHAR>().data(),
-                   value.template get<TYPE_CHAR>().size());
-            return ComparisonPredicateBase<TYPE_CHAR, PT>::create_shared(
-                    cid, col_name, 
Field::create_field<TYPE_CHAR>(std::move(tmp)), opposite);
-        } else {
-            return ComparisonPredicateBase<TYPE_CHAR, PT>::create_shared(
-                    cid, col_name, 
Field::create_field<TYPE_CHAR>(value.template get<TYPE_CHAR>()),
-                    opposite);
-        }
+        return ComparisonPredicateBase<TYPE_CHAR, PT>::create_shared(
+                cid, col_name, Field::create_field<TYPE_CHAR>(value.template 
get<TYPE_CHAR>()),

Review Comment:
   The CHAR predicate value is now kept unpadded, but the string inverted-index 
path still receives this value directly via 
`ComparisonPredicateBase::evaluate()` and queries it as the index term. CHAR 
values are written to OLAP storage padded to the schema length in 
`OlapColumnDataConvertorChar::convert_to_olap()`, and 
`InvertedIndexColumnWriter::add_values()` indexes that padded slice. For a 
CHAR(5) value `a` with an untokenized inverted index, the indexed term is 
`a\0\0\0\0` while `WHERE c = 'a'` queries `a`, so inverted-index pruning can 
return an empty bitmap and drop matching rows. The IN/NOT IN path has the same 
mismatch after removing CHAR padding from `InListPredicateBase`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [refactor](be) remove CHAR padding on read [doris]

Reply via email to