cloud-fan commented on code in PR #56569:
URL: https://github.com/apache/spark/pull/56569#discussion_r3431616339
##########
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java:
##########
@@ -1157,7 +1157,7 @@ public UTF8String reverse() {
int i = 0; // position in byte
while (i < numBytes) {
- int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes);
+ int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes - i);
Review Comment:
The fix here is correct. One note on the rationale: the PR description's
claim that "every other byte scan in the class already bounds by the remaining
bytes" isn't quite accurate. `trimLeft` (L1052), `trimRight` (L1124/L1135, via
`copyUTF8String`), and `codePointFrom` (L736-742) all derive a per-character
width from `numBytesForFirstByte(...)` and read that many bytes without
clamping to the remaining bytes — `copyUTF8String` does an unclamped
`copyMemory` (L944-946) and `getByte` is an unchecked `Platform.getByte`. So on
the same truncated-trailing-sequence input that motivates this PR, those peers
over-read identically. Worth softening the description, or filing a follow-up
for the siblings (cf. @uros-b's note about similar issues across the codebase).
Not blocking — the `reverse()` change itself is clean and well-tested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]