sahvx655-wq opened a new pull request, #64317:
URL: https://github.com/apache/doris/pull/64317
### What problem does this PR solve?
Problem Summary:
While reading parquet files through the external file path I traced how
dictionary encoded columns are decoded. The index bit width is taken from the
first byte of the data page in `BaseDictDecoder::set_data`, fed straight into
the RLE batch decoder, and the resulting values are used as
`_dict_items[_indexes[...]]` in both `ByteArrayDictDecoder` and
`FixLengthDictDecoder`. Nothing compares those indices against the dictionary
size, so a file whose data page advertises a wider bit width than the
dictionary needs (or simply emits an index past the end) walks off the end of
`_dict_items`. Under ASAN this shows up as a heap-buffer-overflow read inside
the decode loop; without it the reader silently dereferences whatever memory
follows the vector.
The root cause is the missing range check between the untrusted index stream
and the dictionary it addresses. Both decoders share the same
`_indexes`/`_dict_items` shape, so the guard sits in the common base and is
called from each decoder right after the batch is decoded, before any lookup
happens. Left unfixed this is an out of bounds read reachable by anyone able to
point Doris at a crafted parquet file (table valued function, external table,
file load), so closing it at the decoder layer is safer than trusting the
encoder to stay within the dictionary.
### Release note
None
### Check List (For Author)
- Test
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [x] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [ ] Previous test can cover this change.
- [ ] No code files have been changed.
- [x] Other reason: the change only adds a defensive bounds check
that rejects corrupt input. Triggering it needs a hand crafted parquet file
carrying out of range dictionary indices; the path was validated here by
annotating the decode flow rather than committing such a fixture.
- Behavior changed:
- [x] No.
- [ ] Yes.
- Does this need documentation?
- [x] No.
- [ ] Yes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]