Arawoof06 opened a new pull request, #50356:
URL: https://github.com/apache/arrow/pull/50356
### Rationale for this change
`utf8_length_ignore_invalid` extends `char_len` while scanning continuation
bytes and never rechecks the buffer end, so an input ending in a truncated
multi-byte utf8 sequence (a `0xF0` lead byte followed by non-continuation
bytes) reads past `data_len`. It is reached from untrusted string data through
`lpad`/`rpad`, which count the input glyphs before padding. Reproduced against
a verbatim copy of the function under AddressSanitizer with the 4-byte input
`{0xF0, 'a', 'a', 'a'}` in an exactly-sized heap buffer, giving
`heap-buffer-overflow READ ... 0 bytes after 4-byte region`.
### What changes are included in this PR?
Bound the continuation-byte scan with `i + j < data_len`, the same guard the
sibling helpers (`utf8_length`, `reverse_utf8`, `utf8_byte_pos`) already use.
Valid input is unaffected because a well-formed glyph always satisfies `i +
char_len <= data_len`.
### Are these changes tested?
Yes. Added `TestStringOps.TestPadMalformedUtf8NoOverread`, which runs
`lpad`/`rpad` on the truncated multi-byte input placed in an exactly-sized heap
buffer so the over-read trips ASAN; the existing pad tests still pass.
### Are there any user-facing changes?
No.
**This PR contains a "Critical Fix".** It fixes an out-of-bounds read in the
Gandiva utf8 length helper reachable from `lpad`/`rpad` on crafted string data.
* GitHub Issue: #50355
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]