Re: [PR] [SPARK-57507][SQL] Fix `UTF8String.reverse` reading past the end of a truncated trailing UTF-8 sequence [spark]

via GitHub Wed, 17 Jun 2026 14:37:35 -0700


cloud-fan commented on code in PR #56569:
URL: https://github.com/apache/spark/pull/56569#discussion_r3431616339



##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1157,7 +1157,7 @@ public UTF8String reverse() {
 
     int i = 0; // position in byte
     while (i < numBytes) {
-      int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes);
+      int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes - i);

Review Comment:
   The fix here is correct. One note on the rationale: the PR description's 
claim that "every other byte scan in the class already bounds by the remaining 
bytes" isn't quite accurate. `trimLeft` (L1052), `trimRight` (L1124/L1135, via 
`copyUTF8String`), and `codePointFrom` (L736-742) all derive a per-character 
width from `numBytesForFirstByte(...)` and read that many bytes without 
clamping to the remaining bytes — `copyUTF8String` does an unclamped 
`copyMemory` (L944-946) and `getByte` is an unchecked `Platform.getByte`. So on 
the same truncated-trailing-sequence input that motivates this PR, those peers 
over-read identically. Worth softening the description, or filing a follow-up 
for the siblings (cf. @uros-b's note about similar issues across the codebase). 
Not blocking — the `reverse()` change itself is clean and well-tested.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-57507][SQL] Fix `UTF8String.reverse` reading past the end of a truncated trailing UTF-8 sequence [spark]

Reply via email to