LuciferYang opened a new pull request, #56569: URL: https://github.com/apache/spark/pull/56569
### What changes were proposed in this pull request? `UTF8String.reverse()` reverses a string one UTF-8 character at a time, using `numBytesForFirstByte(getByte(i))` to determine the width of the character at byte position `i`. The per-character copy length was clamped to `numBytes` (the total length) instead of `numBytes - i` (the bytes that actually remain). When the last bytes of the string are a truncated multi-byte sequence (a leader byte whose declared width exceeds the remaining bytes), `copyMemory` read past the end of the string into adjacent memory. This PR changes the clamp to `numBytes - i`. ### Why are the changes needed? `UTF8String` can hold malformed UTF-8 (for example, bytes produced by binary coercion or truncated input). For such a string ending in an incomplete multi-byte sequence, `reverse()` performed an out-of-bounds read and produced a wrong result. Every other byte scan in the class already bounds by the remaining bytes; this brings `reverse()` in line. Well-formed UTF-8 is unaffected, since a complete sequence never exceeds the remaining bytes. ### Does this PR introduce _any_ user-facing change? Yes, it fixes incorrect results. The SQL `reverse()` function on a string value that contains malformed UTF-8 ending in a truncated multi-byte sequence no longer reads past the end of the value; only previously-incorrect results change. ### How was this patch tested? Added cases to `UTF8StringSuite#reverse()` for truncated trailing 2-, 3-, and 4-byte leaders, and for a complete multi-byte character followed by an orphan leader. Each uses a sliced backing array with a trailing sentinel byte so the previous over-read produces a deterministically wrong value; the cases fail on the old code and pass with the fix. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.8) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
