jubins opened a new pull request, #56625:
URL: https://github.com/apache/spark/pull/56625

   ## What is the purpose of the change
   
   Fixes SPARK-57578 — `UTF8String.codePointFrom`, `trimLeft(UTF8String)`, and 
`trimRight(UTF8String)` perform out-of-bounds native memory reads when the 
string ends in a truncated multi-byte UTF-8 sequence (i.e., a leading byte is 
present but one or more continuation bytes are missing).
   
   The root cause is that all three methods call `numBytesForFirstByte()` to 
determine a character's byte width and then unconditionally read that many 
bytes forward, without checking whether the range stays within `numBytes`. For 
a string whose last byte is a lone 2/3/4-byte leader, this reads past the end 
of the backing buffer via `Platform.getByte(base, offset + n)` — which either 
silently returns adjacent memory or causes a JVM crash depending on the 
allocator and platform.
   
   The companion method `reverse()` was fixed for the exact same class of bug 
in SPARK-57507 (line 1160) using `Math.min(numBytesForFirstByte(getByte(i)), 
numBytes - i)`. That commit explicitly noted `codePointFrom`, `trimLeft`, and 
`trimRight` as unfixed siblings tracked in SPARK-57520. This PR applies the 
same clamping pattern to all three.
   
   ## Brief change log
   
   - `UTF8String.codePointFrom()`: replaced `int numBytes = 
numBytesForFirstByte(b)` with `int numBytes = Math.min(numBytesForFirstByte(b), 
this.numBytes - byteIndex)` so the `case 2/3/4` branches never read 
continuation bytes that do not exist
   - `UTF8String.trimLeft(UTF8String)`: clamped the `copyUTF8String` end 
argument from `searchIdx + numBytesForFirstByte(...) - 1` to `searchIdx + 
Math.min(numBytesForFirstByte(...), numBytes - searchIdx) - 1`
   - `UTF8String.trimRight(UTF8String)`: clamped `stringCharLen[numChars]` from 
`numBytesForFirstByte(getByte(charIdx))` to 
`Math.min(numBytesForFirstByte(getByte(charIdx)), numBytes - charIdx)` so the 
subsequent `copyUTF8String` call cannot extend past the buffer
   - `UTF8StringSuite`: added unit tests for all three methods covering lone 
2-, 3-, and 4-byte leaders as both the source string and the trim-set string
   
   ## Verifying this change
   
   This change is covered by unit tests in `UTF8StringSuite`:
   
   - `trimRightWithTrimString` / `trimLeftWithTrimString`: new assertions 
confirm that passing a truncated single-byte trim-set (lone `0xC3`, `0xE4`, 
`0xF0`) does not crash and returns the input unchanged when no match exists; 
also covers a source string whose trailing byte is a lone multi-byte leader
   - `testCodePointFrom`: new assertions confirm that calling 
`codePointFrom(0)` on a 1-byte string whose only byte is a 2-byte leader, and 
`codePointFrom(1)` on a 2-byte string whose second byte is a 3- or 4-byte 
leader, completes without reading out of bounds
   
   ## Does this pull request potentially affect one of the following parts
   
   - Dependencies (does it add or upgrade a dependency): **no**
   - The public API, i.e., is any changed class annotated with 
`@Public`/`@Evolving`: **no** — `UTF8String` is an internal unsafe type
   - The serializers: **no**
   - The runtime per-record code paths (performance sensitive): **yes** — 
`trimLeft`, `trimRight`, and `codePointFrom` are hot paths, but the fix adds 
only a single `Math.min` call per character iteration, which is negligible
   - Anything that affects deployment or recovery: **no**
   - The S3 file system connector: **no**
   
   ## Documentation
   
   Does this pull request introduce a new feature? **no**
   
   If yes, how is the feature documented? not applicable
   
   ## Was generative AI tooling used to co-author this PR?
   - [x] Yes — Claude Code was used as a pair-programming assistant. All code 
was written, understood, and verified by the author.
   Generated-by: Claude Opus 4.8


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to