jubins opened a new pull request, #56625: URL: https://github.com/apache/spark/pull/56625
## What is the purpose of the change Fixes SPARK-57578 — `UTF8String.codePointFrom`, `trimLeft(UTF8String)`, and `trimRight(UTF8String)` perform out-of-bounds native memory reads when the string ends in a truncated multi-byte UTF-8 sequence (i.e., a leading byte is present but one or more continuation bytes are missing). The root cause is that all three methods call `numBytesForFirstByte()` to determine a character's byte width and then unconditionally read that many bytes forward, without checking whether the range stays within `numBytes`. For a string whose last byte is a lone 2/3/4-byte leader, this reads past the end of the backing buffer via `Platform.getByte(base, offset + n)` — which either silently returns adjacent memory or causes a JVM crash depending on the allocator and platform. The companion method `reverse()` was fixed for the exact same class of bug in SPARK-57507 (line 1160) using `Math.min(numBytesForFirstByte(getByte(i)), numBytes - i)`. That commit explicitly noted `codePointFrom`, `trimLeft`, and `trimRight` as unfixed siblings tracked in SPARK-57520. This PR applies the same clamping pattern to all three. ## Brief change log - `UTF8String.codePointFrom()`: replaced `int numBytes = numBytesForFirstByte(b)` with `int numBytes = Math.min(numBytesForFirstByte(b), this.numBytes - byteIndex)` so the `case 2/3/4` branches never read continuation bytes that do not exist - `UTF8String.trimLeft(UTF8String)`: clamped the `copyUTF8String` end argument from `searchIdx + numBytesForFirstByte(...) - 1` to `searchIdx + Math.min(numBytesForFirstByte(...), numBytes - searchIdx) - 1` - `UTF8String.trimRight(UTF8String)`: clamped `stringCharLen[numChars]` from `numBytesForFirstByte(getByte(charIdx))` to `Math.min(numBytesForFirstByte(getByte(charIdx)), numBytes - charIdx)` so the subsequent `copyUTF8String` call cannot extend past the buffer - `UTF8StringSuite`: added unit tests for all three methods covering lone 2-, 3-, and 4-byte leaders as both the source string and the trim-set string ## Verifying this change This change is covered by unit tests in `UTF8StringSuite`: - `trimRightWithTrimString` / `trimLeftWithTrimString`: new assertions confirm that passing a truncated single-byte trim-set (lone `0xC3`, `0xE4`, `0xF0`) does not crash and returns the input unchanged when no match exists; also covers a source string whose trailing byte is a lone multi-byte leader - `testCodePointFrom`: new assertions confirm that calling `codePointFrom(0)` on a 1-byte string whose only byte is a 2-byte leader, and `codePointFrom(1)` on a 2-byte string whose second byte is a 3- or 4-byte leader, completes without reading out of bounds ## Does this pull request potentially affect one of the following parts - Dependencies (does it add or upgrade a dependency): **no** - The public API, i.e., is any changed class annotated with `@Public`/`@Evolving`: **no** — `UTF8String` is an internal unsafe type - The serializers: **no** - The runtime per-record code paths (performance sensitive): **yes** — `trimLeft`, `trimRight`, and `codePointFrom` are hot paths, but the fix adds only a single `Math.min` call per character iteration, which is negligible - Anything that affects deployment or recovery: **no** - The S3 file system connector: **no** ## Documentation Does this pull request introduce a new feature? **no** If yes, how is the feature documented? not applicable ## Was generative AI tooling used to co-author this PR? - [x] Yes — Claude Code was used as a pair-programming assistant. All code was written, understood, and verified by the author. Generated-by: Claude Opus 4.8 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
