[PR] GH-50355: [C++][Gandiva] fix out-of-bounds read in utf8_length_ignore_invalid [arrow]

via GitHub Fri, 03 Jul 2026 02:13:16 -0700


Arawoof06 opened a new pull request, #50356:
URL: https://github.com/apache/arrow/pull/50356


   ### Rationale for this change
   
   `utf8_length_ignore_invalid` extends `char_len` while scanning continuation 
bytes and never rechecks the buffer end, so an input ending in a truncated 
multi-byte utf8 sequence (a `0xF0` lead byte followed by non-continuation 
bytes) reads past `data_len`. It is reached from untrusted string data through 
`lpad`/`rpad`, which count the input glyphs before padding. Reproduced against 
a verbatim copy of the function under AddressSanitizer with the 4-byte input 
`{0xF0, 'a', 'a', 'a'}` in an exactly-sized heap buffer, giving 
`heap-buffer-overflow READ ... 0 bytes after 4-byte region`.
   
   ### What changes are included in this PR?
   
   Bound the continuation-byte scan with `i + j < data_len`, the same guard the 
sibling helpers (`utf8_length`, `reverse_utf8`, `utf8_byte_pos`) already use. 
Valid input is unaffected because a well-formed glyph always satisfies `i + 
char_len <= data_len`.
   
   ### Are these changes tested?
   
   Yes. Added `TestStringOps.TestPadMalformedUtf8NoOverread`, which runs 
`lpad`/`rpad` on the truncated multi-byte input placed in an exactly-sized heap 
buffer so the over-read trips ASAN; the existing pad tests still pass.
   
   ### Are there any user-facing changes?
   
   No.
   
   **This PR contains a "Critical Fix".** It fixes an out-of-bounds read in the 
Gandiva utf8 length helper reachable from `lpad`/`rpad` on crafted string data.
   
   * GitHub Issue: #50355
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] GH-50355: [C++][Gandiva] fix out-of-bounds read in utf8_length_ignore_invalid [arrow]

Reply via email to