sahvx655-wq commented on code in PR #64317:
URL: https://github.com/apache/doris/pull/64317#discussion_r3489652023


##########
be/src/format/parquet/decoder.h:
##########
@@ -152,6 +152,20 @@ class BaseDictDecoder : public Decoder {
         return Status::OK();
     }
 
+    // The index bit width is read from the data page and is fully attacker 
controlled,
+    // so a decoded index may point past the dictionary. Reject it before it 
is used to
+    // look up _dict_items.
+    Status _check_dict_indexes(size_t dict_size) {

Review Comment:
   Good catch, you're right that the post-GetBatch check ran too late. I traced 
it back through the batch decoder: the width byte goes straight into 
RleBatchDecoder<uint32_t>, and on a repeated run NextCounts calls 
GetBytes<uint32_t>(BitUtil::Ceil(bit_width, 8), &repeated_value_), where the 
num_bytes <= sizeof(T) guard is only a DCHECK. So for any width above 32 a 
release build memcpys 5+ bytes into the 4-byte repeated_value_ before my index 
check is ever reached, and literal runs wider than 32 come back truncated or 
zero filled. The corrupt stream needed rejecting earlier.
   
   I've moved the validation into set_data: it now rejects an empty page and 
any width greater than sizeof(uint32_t) * CHAR_BIT before the RLE decoder is 
constructed, so a malformed bit width never reaches the batch reader. Added 
decoder-level negative tests in both the byte-array and fixed-length dict 
decoder tests covering the oversized width, the empty page and the out of range 
index path. Pushed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to