[ 
https://issues.apache.org/jira/browse/SPARK-57578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-57578:
-----------------------------------
    Labels: pull-request-available  (was: )

> [SQL] UTF8String.codePointFrom/trimLeft/trimRight out-of-bounds read on 
> truncated trailing UTF-8 sequences
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-57578
>                 URL: https://issues.apache.org/jira/browse/SPARK-57578
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Jubin Soni
>            Priority: Major
>              Labels: pull-request-available
>
> *Problem*
> UTF8String.codePointFrom, trimLeft, and trimRight share the same class of 
> out-of-bounds native memory read that was fixed in reverse() by SPARK-57507.
> When a UTF8String ends in a truncated multi-byte sequence — i.e., the string 
> contains only the leading byte of a 2/3/4-byte UTF-8 sequence with the
> continuation bytes missing — these methods call numBytesForFirstByte() to get 
> the character width and then unconditionally read that many bytes forward 
> without checking whether that range is within the string's bounds.
> Affected methods in UTF8String.java:
> {code:java}
> codePointFrom(int byteIndex) [lines 728-746]:
>     case 2: reads getByte(byteIndex + 1)
>     case 3: reads getByte(byteIndex + 1) and getByte(byteIndex + 2)
>     case 4: reads getByte(byteIndex + 1..3)
>     No check that byteIndex + n < numBytes.
>   trimLeft(UTF8String trimString) [line 1051-1052]:
>     copyUTF8String(searchIdx,
>       searchIdx + numBytesForFirstByte(this.getByte(searchIdx)) - 1)
>     The end index is not clamped to numBytes - 1.
>   trimRight(UTF8String trimString) [lines 1122-1135]:
>     stringCharLen[numChars] = numBytesForFirstByte(getByte(charIdx))
>     stored without clamping, then used as:
>     copyUTF8String(stringCharPos[numChars-1],
>       stringCharPos[numChars-1] + stringCharLen[numChars-1] - 1)
> {code}
>     which can extend past the end of the backing buffer.
> The companion method reverse() was fixed in SPARK-57507 (line 1160) using:
>     int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes - i);
> and that commit explicitly noted these siblings are tracked in SPARK-57520.
> *How to Reproduce*
> The trigger is any SQL expression that calls trim/ltrim/rtrim with a custom 
> trim character set on a string whose last UTF-8 character is incomplete, or
> that calls locate/position which internally uses codePointFrom. Such strings 
> can arrive from binary-safe sources (BINARY casts, Parquet/ORC raw bytes, 
> user-supplied byte arrays via the Java/Scala API).
> Minimal Java reproducer (unit test level):
> {code:java}
> // 0xC3 is the leading byte of a 2-byte UTF-8 sequence
>   // but the continuation byte is missing — the string is 1 byte long.
>   UTF8String truncated = UTF8String.fromBytes(new byte[]{(byte) 0xC3});
>   // trimLeft: passes truncated + ' ' as trim set, triggering the over-read
>   UTF8String.fromString(" hello").trimLeft(truncated);
>   // trimRight: same
>   UTF8String.fromString("hello ").trimRight(truncated);
>   // codePointFrom: called directly or via locate()
>   truncated.codePointFrom(0);  // reads getByte(1) on a 1-byte string
> {code}
>  
> At the JVM level this is an off-heap read via Platform.getByte(base, 
> offset+n) where n >= numBytes, which either reads adjacent memory silently or 
> triggers a JVM crash depending on the allocator and platform.
> *Expected:* Methods should not read beyond numBytes - 1. When the last 
> sequence is truncated, the width should be clamped to the remaining bytes 
> (same as the already-fixed reverse()).
> *Actual:* Methods read past the end of the backing buffer.
> *Work Needed*
> The fix follows the same pattern as reverse() (SPARK-57507, line 1160).
> 1. codePointFrom: before the switch, clamp numBytes local variable:
> {code:java}
>      int numBytes = Math.min(numBytesForFirstByte(b), this.numBytes - 
> byteIndex);{code}
>    so the case 2/3/4 branches only read continuation bytes that actually 
> exist.
>  
> 2. trimLeft: change line 1052 from:
> {code:java}
>      searchIdx + numBytesForFirstByte(this.getByte(searchIdx)) - 1{code}
>    to:
> {code:java}
>      searchIdx + Math.min(numBytesForFirstByte(this.getByte(searchIdx)),
>                           numBytes - searchIdx) - 1{code}
>  
> 3. trimRight: change line 1124 from:
> {code:java}
>    stringCharLen[numChars] = numBytesForFirstByte(getByte(charIdx));
>    to:
>      stringCharLen[numChars] = 
> Math.min(numBytesForFirstByte(getByte(charIdx)),
>                                         numBytes - charIdx);
> {code}
> Add unit tests in UTF8StringSuite covering trimLeft, trimRight, and 
> codePointFrom on a UTF8String backed by a byte array that ends in a truncated
> multi-byte leading byte (e.g. new byte[]\{0x41, (byte)0xC3} — 'A' followed by 
> a lone 2-byte leader).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to