Re: [PR] bounds-check dictionary indices in parquet dict decoders [doris]

via GitHub Tue, 09 Jun 2026 08:00:28 -0700


github-actions[bot] commented on code in PR #64317:
URL: https://github.com/apache/doris/pull/64317#discussion_r3381670423



##########
be/src/format/parquet/decoder.h:
##########
@@ -152,6 +152,20 @@ class BaseDictDecoder : public Decoder {
         return Status::OK();
     }
 
+    // The index bit width is read from the data page and is fully attacker 
controlled,
+    // so a decoded index may point past the dictionary. Reject it before it 
is used to
+    // look up _dict_items.
+    Status _check_dict_indexes(size_t dict_size) {

Review Comment:
   This still leaves the untrusted bit width unchecked before 
`_check_dict_indexes()` runs. `BaseDictDecoder::set_data()` reads the first 
data-page byte and constructs `RleBatchDecoder<uint32_t>` with it; then both 
callers invoke `GetBatch()` before this helper. For a crafted page with 
`bit_width > 32` and a repeated run, `RleBatchDecoder::NextCounts()` calls 
`BatchedBitReader::GetBytes<uint32_t>(BitUtil::Ceil(bit_width, 8), 
&repeated_value_)`. The `num_bytes <= sizeof(T)` guard there is only a 
`DCHECK`, so release builds can `memcpy` 5+ bytes into a 4-byte `uint32_t` 
before the new bounds check is reached. Literal runs with widths above 32 can 
also get truncated or fail with zero-filled `_indexes`, so this does not 
reliably reject the malformed index stream.
   
   Please validate the dictionary index bit width in 
`BaseDictDecoder::set_data()` before constructing/using the RLE decoder, for 
example reject empty page data and any width greater than `sizeof(uint32_t) * 
CHAR_BIT`, and add decoder-level negative tests in the existing byte-array and 
fixed-length dict decoder tests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] bounds-check dictionary indices in parquet dict decoders [doris]

Reply via email to