sahvx655-wq commented on code in PR #64317:
URL: https://github.com/apache/doris/pull/64317#discussion_r3489652023
##########
be/src/format/parquet/decoder.h:
##########
@@ -152,6 +152,20 @@ class BaseDictDecoder : public Decoder {
return Status::OK();
}
+ // The index bit width is read from the data page and is fully attacker
controlled,
+ // so a decoded index may point past the dictionary. Reject it before it
is used to
+ // look up _dict_items.
+ Status _check_dict_indexes(size_t dict_size) {
Review Comment:
Good catch, you're right that the post-GetBatch check ran too late. I traced
it back through the batch decoder: the width byte goes straight into
RleBatchDecoder<uint32_t>, and on a repeated run NextCounts calls
GetBytes<uint32_t>(BitUtil::Ceil(bit_width, 8), &repeated_value_), where the
num_bytes <= sizeof(T) guard is only a DCHECK. So for any width above 32 a
release build memcpys 5+ bytes into the 4-byte repeated_value_ before my index
check is ever reached, and literal runs wider than 32 come back truncated or
zero filled. The corrupt stream needed rejecting earlier.
I've moved the validation into set_data: it now rejects an empty page and
any width greater than sizeof(uint32_t) * CHAR_BIT before the RLE decoder is
constructed, so a malformed bit width never reaches the batch reader. Added
decoder-level negative tests in both the byte-array and fixed-length dict
decoder tests covering the oversized width, the empty page and the out of range
index path. Pushed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]