sahvx655-wq opened a new pull request, #64317:
URL: https://github.com/apache/doris/pull/64317

   ### What problem does this PR solve?
   
   Problem Summary:
   
   While reading parquet files through the external file path I traced how 
dictionary encoded columns are decoded. The index bit width is taken from the 
first byte of the data page in `BaseDictDecoder::set_data`, fed straight into 
the RLE batch decoder, and the resulting values are used as 
`_dict_items[_indexes[...]]` in both `ByteArrayDictDecoder` and 
`FixLengthDictDecoder`. Nothing compares those indices against the dictionary 
size, so a file whose data page advertises a wider bit width than the 
dictionary needs (or simply emits an index past the end) walks off the end of 
`_dict_items`. Under ASAN this shows up as a heap-buffer-overflow read inside 
the decode loop; without it the reader silently dereferences whatever memory 
follows the vector.
   
   The root cause is the missing range check between the untrusted index stream 
and the dictionary it addresses. Both decoders share the same 
`_indexes`/`_dict_items` shape, so the guard sits in the common base and is 
called from each decoder right after the batch is decoded, before any lookup 
happens. Left unfixed this is an out of bounds read reachable by anyone able to 
point Doris at a crafted parquet file (table valued function, external table, 
file load), so closing it at the decoder layer is safer than trusting the 
encoder to stay within the dictionary.
   
   ### Release note
   
   None
   
   ### Check List (For Author)
   
   - Test
       - [ ] Regression test
       - [ ] Unit Test
       - [ ] Manual test (add detailed scripts or steps below)
       - [x] No need to test or manual test. Explain why:
           - [ ] This is a refactor/code format and no logic has been changed.
           - [ ] Previous test can cover this change.
           - [ ] No code files have been changed.
           - [x] Other reason: the change only adds a defensive bounds check 
that rejects corrupt input. Triggering it needs a hand crafted parquet file 
carrying out of range dictionary indices; the path was validated here by 
annotating the decode flow rather than committing such a fixture.
   
   - Behavior changed:
       - [x] No.
       - [ ] Yes.
   
   - Does this need documentation?
       - [x] No.
       - [ ] Yes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to