github-actions[bot] commented on code in PR #60374:
URL: https://github.com/apache/doris/pull/60374#discussion_r2893506686


##########
be/src/vec/exec/format/parquet/byte_array_dict_decoder.cpp:
##########
@@ -81,6 +81,13 @@ Status 
ByteArrayDictDecoder::read_dict_values_to_column(MutableColumnPtr& doris_
 MutableColumnPtr ByteArrayDictDecoder::convert_dict_column_to_string_column(
         const ColumnInt32* dict_column) {
     auto res = ColumnString::create();
+    if (_dict_items.empty()) {
+        if (dict_column->size() > 0) {
+            LOG(ERROR) << "Attempt to convert dict column with empty 
dictionary, column size: "

Review Comment:
   **Defensive programming concern (per AGENTS.md):** If `_dict_items.empty()` 
and `dict_column->size() > 0`, this silently returns an empty column (0 rows), 
dropping data. Per the coding standards, this should not be masked with an `if` 
check + LOG:
   
   - If this condition is truly unreachable (because all-null columns should 
never produce non-null dict indices), then use `DORIS_CHECK(dict_column->size() 
== 0)` instead of the `if` + LOG(ERROR).
   - If it IS reachable, then returning 0 rows silently drops data, which is a 
correctness issue.
   
   Suggested:
   ```cpp
   if (_dict_items.empty()) {
       DORIS_CHECK(dict_column->size() == 0) << "Attempt to convert dict column 
with empty dictionary, column size: " << dict_column->size();
       return res;
   }
   ```



##########
be/src/vec/exec/format/parquet/vparquet_column_chunk_reader.cpp:
##########
@@ -417,12 +417,21 @@ Status ColumnChunkReader<IN_COLLECTION, 
OFFSET_INDEX>::_decode_dict_page() {
     if (!dict_loaded) {
         // Load and decompress dictionary page from file
         if (_block_compress_codec != nullptr) {
+            auto dict_num = header->dictionary_page_header.num_values;
+            if (dict_num == 0 && uncompressed_size != 0) {
+                return Status::IOError(
+                        "Dictionary page's num_values is {} but 
uncompressed_size is {}", dict_num,
+                        uncompressed_size);
+            }

Review Comment:
   `DCHECK` disappears in RELEASE builds (known high-risk pattern per coding 
standards). Since this is an invariant assertion (if `dict_num != 0`, 
`compressed_data` must have been populated by `get_page_data`), this should be 
`DORIS_CHECK(!compressed_data.empty())` to provide protection in production.



##########
be/src/vec/exec/format/parquet/fix_length_dict_decoder.hpp:
##########
@@ -81,7 +81,7 @@ class FixLengthDictDecoder final : public BaseDictDecoder {
                           ColumnSelectVector& select_vector, bool 
is_dict_filter) {
         size_t non_null_size = select_vector.num_values() - 
select_vector.num_nulls();
         if (doris_column->is_column_dictionary() &&
-            assert_cast<ColumnDictI32&>(*doris_column).dict_size() == 0) {
+            assert_cast<ColumnDictI32&>(*doris_column).dict_size() == 0 && 
!_dict_items.empty()) {
             std::vector<StringRef> dict_items;

Review Comment:
   Same concern as in 
`ByteArrayDictDecoder::convert_dict_column_to_string_column` — this LOG(ERROR) 
+ return-empty pattern is defensive programming that masks potential data 
corruption. Should use `DORIS_CHECK(dict_column->size() == 0)` instead of the 
`if` + LOG(ERROR) guard, to crash on invariant violation rather than silently 
dropping rows.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to