sahvx655-wq commented on PR #64317:
URL: https://github.com/apache/doris/pull/64317#issuecomment-4661030913

   Quick summary for the template above.
   
   The problem is an out of bounds read in the parquet dictionary decoders. I 
was tracing how dictionary encoded columns get decoded on the external file 
path and noticed that 'BaseDictDecoder::set_data' takes the index bit width 
straight from the first byte of the data page ('uint8_t bit_width = 
*data->data') and feeds it to the RLE batch decoder. The decoded values are 
then used as '_dict_items[_indexes[...]]' in both 
'ByteArrayDictDecoder::_decode_values' and 
'FixLengthDictDecoder::_decode_values', and nothing ever compares them against 
the dictionary size. A data page that advertises a wider bit width than the 
dictionary needs, or simply emits an index past the end, walks off 
'_dict_items'. Under ASAN it shows up as a heap-buffer-overflow read inside the 
decode loop; without ASAN the reader just dereferences whatever memory happens 
to follow the vector.
   
   The fix is the single helper 'BaseDictDecoder::_check_dict_indexes', called 
right after 'GetBatch' in both decoders, which validates every decoded index 
against the dictionary size and returns 'Status::Corruption' before any lookup 
happens. Behaviour only changes for malformed input: well formed files decode 
exactly as before, corrupt ones now get a clean error instead of reading past 
the buffer. Left unfixed it is reachable by anyone who can point Doris at a 
crafted parquet file (TVF, external table, file load), so I kept the guard at 
the decoder layer rather than trusting the encoder to stay in range. I 
deliberately kept the diff to just the bounds check, no refactor.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to