LuciferYang opened a new pull request, #56569:
URL: https://github.com/apache/spark/pull/56569

   ### What changes were proposed in this pull request?
   
   `UTF8String.reverse()` reverses a string one UTF-8 character at a time, 
using `numBytesForFirstByte(getByte(i))` to determine the width of the 
character at byte position `i`. The per-character copy length was clamped to 
`numBytes` (the total length) instead of `numBytes - i` (the bytes that 
actually remain). When the last bytes of the string are a truncated multi-byte 
sequence (a leader byte whose declared width exceeds the remaining bytes), 
`copyMemory` read past the end of the string into adjacent memory. This PR 
changes the clamp to `numBytes - i`.
   
   ### Why are the changes needed?
   
   `UTF8String` can hold malformed UTF-8 (for example, bytes produced by binary 
coercion or truncated input). For such a string ending in an incomplete 
multi-byte sequence, `reverse()` performed an out-of-bounds read and produced a 
wrong result. Every other byte scan in the class already bounds by the 
remaining bytes; this brings `reverse()` in line. Well-formed UTF-8 is 
unaffected, since a complete sequence never exceeds the remaining bytes.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it fixes incorrect results. The SQL `reverse()` function on a string 
value that contains malformed UTF-8 ending in a truncated multi-byte sequence 
no longer reads past the end of the value; only previously-incorrect results 
change.
   
   ### How was this patch tested?
   
   Added cases to `UTF8StringSuite#reverse()` for truncated trailing 2-, 3-, 
and 4-byte leaders, and for a complete multi-byte character followed by an 
orphan leader. Each uses a sliced backing array with a trailing sentinel byte 
so the previous over-read produces a deterministically wrong value; the cases 
fail on the old code and pass with the fix.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Claude Opus 4.8)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to