sahvx655-wq commented on PR #64317:
URL: https://github.com/apache/doris/pull/64317#issuecomment-4661030913
Quick summary for the template above.
The problem is an out of bounds read in the parquet dictionary decoders. I
was tracing how dictionary encoded columns get decoded on the external file
path and noticed that 'BaseDictDecoder::set_data' takes the index bit width
straight from the first byte of the data page ('uint8_t bit_width =
*data->data') and feeds it to the RLE batch decoder. The decoded values are
then used as '_dict_items[_indexes[...]]' in both
'ByteArrayDictDecoder::_decode_values' and
'FixLengthDictDecoder::_decode_values', and nothing ever compares them against
the dictionary size. A data page that advertises a wider bit width than the
dictionary needs, or simply emits an index past the end, walks off
'_dict_items'. Under ASAN it shows up as a heap-buffer-overflow read inside the
decode loop; without ASAN the reader just dereferences whatever memory happens
to follow the vector.
The fix is the single helper 'BaseDictDecoder::_check_dict_indexes', called
right after 'GetBatch' in both decoders, which validates every decoded index
against the dictionary size and returns 'Status::Corruption' before any lookup
happens. Behaviour only changes for malformed input: well formed files decode
exactly as before, corrupt ones now get a clean error instead of reading past
the buffer. Left unfixed it is reachable by anyone who can point Doris at a
crafted parquet file (TVF, external table, file load), so I kept the guard at
the decoder layer rather than trusting the encoder to stay in range. I
deliberately kept the diff to just the bounds check, no refactor.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]