trimRight out-of-bounds read on truncated trailing UTF-8 sequences

Jubin Soni (Jira) Fri, 19 Jun 2026 17:13:15 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-57578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jubin Soni updated SPARK-57578:
-------------------------------
    Description: 
*Problem*

UTF8String.codePointFrom, trimLeft, and trimRight share the same class of
out-of-bounds native memory read that was fixed in reverse() by SPARK-57507.

When a UTF8String ends in a truncated multi-byte sequence — i.e., the string
contains only the leading byte of a 2/3/4-byte UTF-8 sequence with the
continuation bytes missing — these methods call numBytesForFirstByte() to
get the character width and then unconditionally read that many bytes forward
without checking whether that range is within the string's bounds.

Affected methods in UTF8String.java:

  codePointFrom(int byteIndex) [lines 728-746]:
    case 2: reads getByte(byteIndex + 1)
    case 3: reads getByte(byteIndex + 1) and getByte(byteIndex + 2)
    case 4: reads getByte(byteIndex + 1..3)
    No check that byteIndex + n < numBytes.

  trimLeft(UTF8String trimString) [line 1051-1052]:
    copyUTF8String(searchIdx,
      searchIdx + numBytesForFirstByte(this.getByte(searchIdx)) - 1)
    The end index is not clamped to numBytes - 1.

  trimRight(UTF8String trimString) [lines 1122-1135]:
    stringCharLen[numChars] = numBytesForFirstByte(getByte(charIdx))
    stored without clamping, then used as:
    copyUTF8String(stringCharPos[numChars-1],
      stringCharPos[numChars-1] + stringCharLen[numChars-1] - 1)
    which can extend past the end of the backing buffer.

The companion method reverse() was fixed in SPARK-57507 (line 1160) using:
    int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes - i);
and that commit explicitly noted these siblings are tracked in SPARK-57520.

*How to Reproduce*

The trigger is any SQL expression that calls trim/ltrim/rtrim with a custom
trim character set on a string whose last UTF-8 character is incomplete, or
that calls locate/position which internally uses codePointFrom. Such strings
can arrive from binary-safe sources (BINARY casts, Parquet/ORC raw bytes,
user-supplied byte arrays via the Java/Scala API).

Minimal Java reproducer (unit test level):

  // 0xC3 is the leading byte of a 2-byte UTF-8 sequence
  // but the continuation byte is missing — the string is 1 byte long.
  UTF8String truncated = UTF8String.fromBytes(new byte[]\{(byte) 0xC3});

  // trimLeft: passes truncated + ' ' as trim set, triggering the over-read
  UTF8String.fromString(" hello").trimLeft(truncated);

  // trimRight: same
  UTF8String.fromString("hello ").trimRight(truncated);

  // codePointFrom: called directly or via locate()
  truncated.codePointFrom(0);  // reads getByte(1) on a 1-byte string

At the JVM level this is an off-heap read via Platform.getByte(base, offset+n)
where n >= numBytes, which either reads adjacent memory silently or triggers a
JVM crash depending on the allocator and platform.

*Expected:* Methods should not read beyond numBytes - 1. When the last sequence
is truncated, the width should be clamped to the remaining bytes
(same as the already-fixed reverse()).

*Actual:* Methods read past the end of the backing buffer.

*Work Needed*

The fix follows the same pattern as reverse() (SPARK-57507, line 1160).

1. codePointFrom: before the switch, clamp numBytes local variable:
     int numBytes = Math.min(numBytesForFirstByte(b), this.numBytes - 
byteIndex);
   so the case 2/3/4 branches only read continuation bytes that actually exist.

2. trimLeft: change line 1052 from:
     searchIdx + numBytesForFirstByte(this.getByte(searchIdx)) - 1
   to:
     searchIdx + Math.min(numBytesForFirstByte(this.getByte(searchIdx)),
                          numBytes - searchIdx) - 1

3. trimRight: change line 1124 from:
     stringCharLen[numChars] = numBytesForFirstByte(getByte(charIdx));
   to:
     stringCharLen[numChars] = Math.min(numBytesForFirstByte(getByte(charIdx)),
                                        numBytes - charIdx);

Add unit tests in UTF8StringSuite covering trimLeft, trimRight, and
codePointFrom on a UTF8String backed by a byte array that ends in a truncated
multi-byte leading byte (e.g. new byte[]\{0x41, (byte)0xC3} — 'A' followed by
a lone 2-byte leader).

  was:
*Problem*

UTF8String.codePointFrom, trimLeft, and trimRight share the same class of
out-of-bounds native memory read that was fixed in reverse() by SPARK-57507.

When a UTF8String ends in a truncated multi-byte sequence — i.e., the string
contains only the leading byte of a 2/3/4-byte UTF-8 sequence with the
continuation bytes missing — these methods call numBytesForFirstByte() to
get the character width and then unconditionally read that many bytes forward
without checking whether that range is within the string's bounds.

Affected methods in UTF8String.java:

  codePointFrom(int byteIndex) [lines 728-746]:
    case 2: reads getByte(byteIndex + 1)
    case 3: reads getByte(byteIndex + 1) and getByte(byteIndex + 2)
    case 4: reads getByte(byteIndex + 1..3)
    No check that byteIndex + n < numBytes.

  trimLeft(UTF8String trimString) [line 1051-1052]:
    copyUTF8String(searchIdx,
      searchIdx + numBytesForFirstByte(this.getByte(searchIdx)) - 1)
    The end index is not clamped to numBytes - 1.

  trimRight(UTF8String trimString) [lines 1122-1135]:
    stringCharLen[numChars] = numBytesForFirstByte(getByte(charIdx))
    stored without clamping, then used as:
    copyUTF8String(stringCharPos[numChars-1],
      stringCharPos[numChars-1] + stringCharLen[numChars-1] - 1)
    which can extend past the end of the backing buffer.

The companion method reverse() was fixed in SPARK-57507 (line 1160) using:
    int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes - i);
and that commit explicitly noted these siblings are tracked in SPARK-57520.


*How to Reproduce*

The trigger is any SQL expression that calls trim/ltrim/rtrim with a custom
trim character set on a string whose last UTF-8 character is incomplete, or
that calls locate/position which internally uses codePointFrom. Such strings
can arrive from binary-safe sources (BINARY casts, Parquet/ORC raw bytes,
user-supplied byte arrays via the Java/Scala API).

Minimal Java reproducer (unit test level):

  // 0xC3 is the leading byte of a 2-byte UTF-8 sequence
  // but the continuation byte is missing — the string is 1 byte long.
  UTF8String truncated = UTF8String.fromBytes(new byte[]\{(byte) 0xC3});

  // trimLeft: passes truncated + ' ' as trim set, triggering the over-read
  UTF8String.fromString(" hello").trimLeft(truncated);

  // trimRight: same
  UTF8String.fromString("hello ").trimRight(truncated);

  // codePointFrom: called directly or via locate()
  truncated.codePointFrom(0);  // reads getByte(1) on a 1-byte string

At the JVM level this is an off-heap read via Platform.getByte(base, offset+n)
where n >= numBytes, which either reads adjacent memory silently or triggers a
JVM crash depending on the allocator and platform.

Expected: Methods should not read beyond numBytes - 1. When the last sequence
is truncated, the width should be clamped to the remaining bytes
(same as the already-fixed reverse()).

Actual: Methods read past the end of the backing buffer.


*Work Needed*

The fix follows the same pattern as reverse() (SPARK-57507, line 1160).

1. codePointFrom: before the switch, clamp numBytes local variable:
     int numBytes = Math.min(numBytesForFirstByte(b), this.numBytes - 
byteIndex);
   so the case 2/3/4 branches only read continuation bytes that actually exist.

2. trimLeft: change line 1052 from:
     searchIdx + numBytesForFirstByte(this.getByte(searchIdx)) - 1
   to:
     searchIdx + Math.min(numBytesForFirstByte(this.getByte(searchIdx)),
                          numBytes - searchIdx) - 1

3. trimRight: change line 1124 from:
     stringCharLen[numChars] = numBytesForFirstByte(getByte(charIdx));
   to:
     stringCharLen[numChars] = Math.min(numBytesForFirstByte(getByte(charIdx)),
                                        numBytes - charIdx);

Add unit tests in UTF8StringSuite covering trimLeft, trimRight, and
codePointFrom on a UTF8String backed by a byte array that ends in a truncated
multi-byte leading byte (e.g. new byte[]\{0x41, (byte)0xC3} — 'A' followed by
a lone 2-byte leader).


> [SQL] UTF8String.codePointFrom/trimLeft/trimRight out-of-bounds read on 
> truncated trailing UTF-8 sequences
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-57578
>                 URL: https://issues.apache.org/jira/browse/SPARK-57578
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Jubin Soni
>            Priority: Major
>
> *Problem*
> UTF8String.codePointFrom, trimLeft, and trimRight share the same class of
> out-of-bounds native memory read that was fixed in reverse() by SPARK-57507.
> When a UTF8String ends in a truncated multi-byte sequence — i.e., the string
> contains only the leading byte of a 2/3/4-byte UTF-8 sequence with the
> continuation bytes missing — these methods call numBytesForFirstByte() to
> get the character width and then unconditionally read that many bytes forward
> without checking whether that range is within the string's bounds.
> Affected methods in UTF8String.java:
>   codePointFrom(int byteIndex) [lines 728-746]:
>     case 2: reads getByte(byteIndex + 1)
>     case 3: reads getByte(byteIndex + 1) and getByte(byteIndex + 2)
>     case 4: reads getByte(byteIndex + 1..3)
>     No check that byteIndex + n < numBytes.
>   trimLeft(UTF8String trimString) [line 1051-1052]:
>     copyUTF8String(searchIdx,
>       searchIdx + numBytesForFirstByte(this.getByte(searchIdx)) - 1)
>     The end index is not clamped to numBytes - 1.
>   trimRight(UTF8String trimString) [lines 1122-1135]:
>     stringCharLen[numChars] = numBytesForFirstByte(getByte(charIdx))
>     stored without clamping, then used as:
>     copyUTF8String(stringCharPos[numChars-1],
>       stringCharPos[numChars-1] + stringCharLen[numChars-1] - 1)
>     which can extend past the end of the backing buffer.
> The companion method reverse() was fixed in SPARK-57507 (line 1160) using:
>     int len = Math.min(numBytesForFirstByte(getByte(i)), numBytes - i);
> and that commit explicitly noted these siblings are tracked in SPARK-57520.
> *How to Reproduce*
> The trigger is any SQL expression that calls trim/ltrim/rtrim with a custom
> trim character set on a string whose last UTF-8 character is incomplete, or
> that calls locate/position which internally uses codePointFrom. Such strings
> can arrive from binary-safe sources (BINARY casts, Parquet/ORC raw bytes,
> user-supplied byte arrays via the Java/Scala API).
> Minimal Java reproducer (unit test level):
>   // 0xC3 is the leading byte of a 2-byte UTF-8 sequence
>   // but the continuation byte is missing — the string is 1 byte long.
>   UTF8String truncated = UTF8String.fromBytes(new byte[]\{(byte) 0xC3});
>   // trimLeft: passes truncated + ' ' as trim set, triggering the over-read
>   UTF8String.fromString(" hello").trimLeft(truncated);
>   // trimRight: same
>   UTF8String.fromString("hello ").trimRight(truncated);
>   // codePointFrom: called directly or via locate()
>   truncated.codePointFrom(0);  // reads getByte(1) on a 1-byte string
> At the JVM level this is an off-heap read via Platform.getByte(base, offset+n)
> where n >= numBytes, which either reads adjacent memory silently or triggers a
> JVM crash depending on the allocator and platform.
> *Expected:* Methods should not read beyond numBytes - 1. When the last 
> sequence
> is truncated, the width should be clamped to the remaining bytes
> (same as the already-fixed reverse()).
> *Actual:* Methods read past the end of the backing buffer.
> *Work Needed*
> The fix follows the same pattern as reverse() (SPARK-57507, line 1160).
> 1. codePointFrom: before the switch, clamp numBytes local variable:
>      int numBytes = Math.min(numBytesForFirstByte(b), this.numBytes - 
> byteIndex);
>    so the case 2/3/4 branches only read continuation bytes that actually 
> exist.
> 2. trimLeft: change line 1052 from:
>      searchIdx + numBytesForFirstByte(this.getByte(searchIdx)) - 1
>    to:
>      searchIdx + Math.min(numBytesForFirstByte(this.getByte(searchIdx)),
>                           numBytes - searchIdx) - 1
> 3. trimRight: change line 1124 from:
>      stringCharLen[numChars] = numBytesForFirstByte(getByte(charIdx));
>    to:
>      stringCharLen[numChars] = 
> Math.min(numBytesForFirstByte(getByte(charIdx)),
>                                         numBytes - charIdx);
> Add unit tests in UTF8StringSuite covering trimLeft, trimRight, and
> codePointFrom on a UTF8String backed by a byte array that ends in a truncated
> multi-byte leading byte (e.g. new byte[]\{0x41, (byte)0xC3} — 'A' followed by
> a lone 2-byte leader).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-57578) [SQL] UTF8String.codePointFrom/trimLeft/trimRight out-of-bounds read on truncated trailing UTF-8 sequences

Reply via email to