Re: RFR: 8372353: API to compute the byte length of a String encoded in a given Charset [v5]

Volkan Yazici Thu, 15 Jan 2026 08:34:41 -0800

On Thu, 15 Jan 2026 14:42:34 GMT, Liam Miller-Cushon <[email protected]> wrote:


>> This implements an API to return the byte length of a String encoded in a 
>> given charset. See 
>> [JDK-8372353](https://bugs.openjdk.org/browse/JDK-8372353) for background.
>> 
>> ---
>> 
>> 
>> Benchmark                              (encoding)  (stringLength)   Mode  
>> Cnt          Score          Error  Units
>> StringLoopJmhBenchmark.getBytes             ASCII              10  thrpt    
>> 5  406782650.595 ± 16960032.852  ops/s
>> StringLoopJmhBenchmark.getBytes             ASCII             100  thrpt    
>> 5  172936926.189 ±  4532029.201  ops/s
>> StringLoopJmhBenchmark.getBytes             ASCII            1000  thrpt    
>> 5   38830681.232 ±  2413274.766  ops/s
>> StringLoopJmhBenchmark.getBytes             ASCII          100000  thrpt    
>> 5     458881.155 ±    12818.317  ops/s
>> StringLoopJmhBenchmark.getBytes            LATIN1              10  thrpt    
>> 5   37193762.990 ±  3962947.391  ops/s
>> StringLoopJmhBenchmark.getBytes            LATIN1             100  thrpt    
>> 5   55400876.236 ±  1267331.434  ops/s
>> StringLoopJmhBenchmark.getBytes            LATIN1            1000  thrpt    
>> 5   11104514.001 ±    41718.545  ops/s
>> StringLoopJmhBenchmark.getBytes            LATIN1          100000  thrpt    
>> 5     182535.414 ±    10296.120  ops/s
>> StringLoopJmhBenchmark.getBytes             UTF16              10  thrpt    
>> 5  113474681.457 ±  8326589.199  ops/s
>> StringLoopJmhBenchmark.getBytes             UTF16             100  thrpt    
>> 5   37854103.127 ±  4808526.773  ops/s
>> StringLoopJmhBenchmark.getBytes             UTF16            1000  thrpt    
>> 5    4139833.009 ±    70636.784  ops/s
>> StringLoopJmhBenchmark.getBytes             UTF16          100000  thrpt    
>> 5      57644.637 ±     1887.112  ops/s
>> StringLoopJmhBenchmark.getBytesLength       ASCII              10  thrpt    
>> 5  946701647.247 ± 76938927.141  ops/s
>> StringLoopJmhBenchmark.getBytesLength       ASCII             100  thrpt    
>> 5  396615374.479 ± 15167234.884  ops/s
>> StringLoopJmhBenchmark.getBytesLength       ASCII            1000  thrpt    
>> 5  100464784.979 ±   794027.897  ops/s
>> StringLoopJmhBenchmark.getBytesLength       ASCII          100000  thrpt    
>> 5    1215487.689 ±     1916.468  ops/s
>> StringLoopJmhBenchmark.getBytesLength      LATIN1              10  thrpt    
>> 5  221265102.323 ± 17013983.056  ops/s
>> StringLoopJmhBenchmark.getBytesLength      LATIN1             100  thrpt    
>> 5  137617873.887 ±  5842185.781  ops/s
>> StringLoopJmhBenchmark.getBytesLength      LATIN1            1000  thrpt    
>> 5   92540259.1...
>
> Liam Miller-Cushon has updated the pull request incrementally with one 
> additional commit since the last revision:
> 
>   Update tests

src/java.base/share/classes/java/lang/String.java line 1097:

> 1095:         return dp;
> 1096:     }
> 1097: 

Verified that this is indeed identical to `encodeASCII` such that only the 
length computation is taken into account.

src/java.base/share/classes/java/lang/String.java line 1512:

> 1510:         return dp;
> 1511:     }
> 1512: 

Verified that this is indeed identical to `encodeUTF8` such that only the 
length computation is taken into account.

src/java.base/share/classes/java/lang/String.java line 1585:

> 1583: 
> 1584:     // This follows the implementation of encodeUTF8_UTF16
> 1585:     private static int encodedLengthUTF8_UTF16(byte[] val) {

Doesn't this duplicate the `computeSizeUTF8_UTF16`?

AFAICS, `computeSizeUTF8_UTF16` is missing the ASCII fast loop, but we can 
enhance it.

FWIW, if we decide reuse `computeSizeUTF8_UTF16`, it might be better to rename 
it to `encodedLengthUTF8_UTF16`, which will be in line with the introduced 
`encodedLength*` method family.

src/java.base/share/classes/java/lang/String.java line 2143:

> 2141:      */
> 2142:     public int getBytesLength(Charset cs) {
> 2143:         if (cs == UTF_8.INSTANCE) {

It'd be nice to catch null values as early as possible. I suggest adding a 
`Objects.requireNonNull(cs)` along with `@throws NullPointerException If {@code 
cs} is null` in docs.

src/java.base/share/classes/java/lang/String.java line 2148:

> 2146:             if (isLatin1()) {
> 2147:                 return value.length;
> 2148:             }

Any particular reason you avoided introducing a `encodedLength8859_1` here? 
(There is a `encode8859_1` method.)

src/java.base/share/classes/java/lang/String.java line 2151:

> 2149:         } else if (cs == US_ASCII.INSTANCE) {
> 2150:             return encodedLengthASCII(coder, value);
> 2151:         } else if (cs instanceof sun.nio.cs.UTF_16LE || cs instanceof 
> sun.nio.cs.UTF_16BE) {

I see that `sun.nio.cs.UTF_16{LE,BE}` specialization is suggested by @ExE-Boss 
[here]. Though I'm not really sure if this is really needed. I cannot spot any 
other usage of these constants in `java.base`, except 
`jdk.internal.foreign.StringSupport`, which is irrelevant.

[here]: https://github.com/openjdk/jdk/pull/28454/files#r2552768341

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2695007692
PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2695024667
PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2695036430
PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2695065623
PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2695076034
PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2695105076

Re: RFR: 8372353: API to compute the byte length of a String encoded in a given Charset [v5]

Reply via email to