[
https://issues.apache.org/jira/browse/SPARK-57578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-57578:
-----------------------------------
Labels: pull-request-available (was: )
> [SQL] UTF8String.codePointFrom/trimLeft/trimRight out-of-bounds read on
> truncated trailing UTF-8 sequences
> ----------------------------------------------------------------------------------------------------------
>
> Key: SPARK-57578
> URL: https://issues.apache.org/jira/browse/SPARK-57578
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Jubin Soni
> Priority: Major
> Labels: pull-request-available
>
> *Problem*
> UTF8String.codePointFrom, trimLeft, and trimRight share the same class of
> out-of-bounds native memory read that was fixed in reverse() by SPARK-57507.
> When a UTF8String ends in a truncated multi-byte sequence — i.e., the string
> contains only the leading byte of a 2/3/4-byte UTF-8 sequence with the
> continuation bytes missing — these methods call numBytesForFirstByte() to get
> the character width and then unconditionally read that many bytes forward
> without checking whether that range is within the string's bounds.
> Affected methods in UTF8String.java:
> {code:java}
> codePointFrom(int byteIndex) [lines 728-746]:
> case 2: reads getByte(byteIndex + 1)
> case 3: reads getByte(byteIndex + 1) and getByte(byteIndex + 2)
> case 4: reads getByte(byteIndex + 1..3)
> No check that byteIndex + n < numBytes.
> trimLeft(UTF8String trimString) [line 1051-1052]:
> copyUTF8String(searchIdx,
> searchIdx + numBytesForFirstByte(this.getByte(searchIdx)) - 1)
> The end index is not clamped to numBytes - 1.
> trimRight(UTF8String trimString) [lines 1122-1135]:
> stringCharLen[numChars] = numBytesForFirstByte(getByte(charIdx))
> stored without clamping, then used as:
> copyUTF8String(stringCharPos[numChars-1],
> stringCharPos[numChars-1] + stringCharLen[numChars-1] - 1)
> {code}
> which can extend past the end of the backing buffer.
> The companion method reverse() was fixed in SPARK-57507 (line 1160) using:
> int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes - i);
> and that commit explicitly noted these siblings are tracked in SPARK-57520.
> *How to Reproduce*
> The trigger is any SQL expression that calls trim/ltrim/rtrim with a custom
> trim character set on a string whose last UTF-8 character is incomplete, or
> that calls locate/position which internally uses codePointFrom. Such strings
> can arrive from binary-safe sources (BINARY casts, Parquet/ORC raw bytes,
> user-supplied byte arrays via the Java/Scala API).
> Minimal Java reproducer (unit test level):
> {code:java}
> // 0xC3 is the leading byte of a 2-byte UTF-8 sequence
> // but the continuation byte is missing — the string is 1 byte long.
> UTF8String truncated = UTF8String.fromBytes(new byte[]{(byte) 0xC3});
> // trimLeft: passes truncated + ' ' as trim set, triggering the over-read
> UTF8String.fromString(" hello").trimLeft(truncated);
> // trimRight: same
> UTF8String.fromString("hello ").trimRight(truncated);
> // codePointFrom: called directly or via locate()
> truncated.codePointFrom(0); // reads getByte(1) on a 1-byte string
> {code}
>
> At the JVM level this is an off-heap read via Platform.getByte(base,
> offset+n) where n >= numBytes, which either reads adjacent memory silently or
> triggers a JVM crash depending on the allocator and platform.
> *Expected:* Methods should not read beyond numBytes - 1. When the last
> sequence is truncated, the width should be clamped to the remaining bytes
> (same as the already-fixed reverse()).
> *Actual:* Methods read past the end of the backing buffer.
> *Work Needed*
> The fix follows the same pattern as reverse() (SPARK-57507, line 1160).
> 1. codePointFrom: before the switch, clamp numBytes local variable:
> {code:java}
> int numBytes = Math.min(numBytesForFirstByte(b), this.numBytes -
> byteIndex);{code}
> so the case 2/3/4 branches only read continuation bytes that actually
> exist.
>
> 2. trimLeft: change line 1052 from:
> {code:java}
> searchIdx + numBytesForFirstByte(this.getByte(searchIdx)) - 1{code}
> to:
> {code:java}
> searchIdx + Math.min(numBytesForFirstByte(this.getByte(searchIdx)),
> numBytes - searchIdx) - 1{code}
>
> 3. trimRight: change line 1124 from:
> {code:java}
> stringCharLen[numChars] = numBytesForFirstByte(getByte(charIdx));
> to:
> stringCharLen[numChars] =
> Math.min(numBytesForFirstByte(getByte(charIdx)),
> numBytes - charIdx);
> {code}
> Add unit tests in UTF8StringSuite covering trimLeft, trimRight, and
> codePointFrom on a UTF8String backed by a byte array that ends in a truncated
> multi-byte leading byte (e.g. new byte[]\{0x41, (byte)0xC3} — 'A' followed by
> a lone 2-byte leader).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]