On Sat, 22 Nov 2025 09:37:31 GMT, ExE Boss <[email protected]> wrote: >> This implements an API to return the byte length of a String encoded in a >> given charset. See >> [JDK-8372353](https://bugs.openjdk.org/browse/JDK-8372353) for background. >> >> --- >> >> >> Benchmark (encoding) (stringLength) Mode >> Cnt Score Error Units >> StringLoopJmhBenchmark.getBytes ASCII 10 thrpt >> 5 406782650.595 ± 16960032.852 ops/s >> StringLoopJmhBenchmark.getBytes ASCII 100 thrpt >> 5 172936926.189 ± 4532029.201 ops/s >> StringLoopJmhBenchmark.getBytes ASCII 1000 thrpt >> 5 38830681.232 ± 2413274.766 ops/s >> StringLoopJmhBenchmark.getBytes ASCII 100000 thrpt >> 5 458881.155 ± 12818.317 ops/s >> StringLoopJmhBenchmark.getBytes LATIN1 10 thrpt >> 5 37193762.990 ± 3962947.391 ops/s >> StringLoopJmhBenchmark.getBytes LATIN1 100 thrpt >> 5 55400876.236 ± 1267331.434 ops/s >> StringLoopJmhBenchmark.getBytes LATIN1 1000 thrpt >> 5 11104514.001 ± 41718.545 ops/s >> StringLoopJmhBenchmark.getBytes LATIN1 100000 thrpt >> 5 182535.414 ± 10296.120 ops/s >> StringLoopJmhBenchmark.getBytes UTF16 10 thrpt >> 5 113474681.457 ± 8326589.199 ops/s >> StringLoopJmhBenchmark.getBytes UTF16 100 thrpt >> 5 37854103.127 ± 4808526.773 ops/s >> StringLoopJmhBenchmark.getBytes UTF16 1000 thrpt >> 5 4139833.009 ± 70636.784 ops/s >> StringLoopJmhBenchmark.getBytes UTF16 100000 thrpt >> 5 57644.637 ± 1887.112 ops/s >> StringLoopJmhBenchmark.getBytesLength ASCII 10 thrpt >> 5 946701647.247 ± 76938927.141 ops/s >> StringLoopJmhBenchmark.getBytesLength ASCII 100 thrpt >> 5 396615374.479 ± 15167234.884 ops/s >> StringLoopJmhBenchmark.getBytesLength ASCII 1000 thrpt >> 5 100464784.979 ± 794027.897 ops/s >> StringLoopJmhBenchmark.getBytesLength ASCII 100000 thrpt >> 5 1215487.689 ± 1916.468 ops/s >> StringLoopJmhBenchmark.getBytesLength LATIN1 10 thrpt >> 5 221265102.323 ± 17013983.056 ops/s >> StringLoopJmhBenchmark.getBytesLength LATIN1 100 thrpt >> 5 137617873.887 ± 5842185.781 ops/s >> StringLoopJmhBenchmark.getBytesLength LATIN1 1000 thrpt >> 5 92540259.1... > > src/java.base/share/classes/java/lang/String.java line 2127: > >> 2125: * equivalent to this string, {@code false} otherwise >> 2126: * >> 2127: * @see #compareTo(String) > > For the **BOM**‑less **UTF‑16** charsets, this can simply return > `value.length << (1 ‑ coder())`[^1]: > > Suggestion: > > if (cs instanceof sun.nio.cs.UTF_16LE || > cs instanceof sun.nio.cs.UTF_16BE) { > return value.length << (1 - coder()); > } > return getBytes(cs).length; > > > [^1]: Lone surrogates get replaced with `U+FFFD` when encoding to **UTF‑16** > by [`String::getBytes(Charset)`], and all of **LATIN1** can be encoded in > **UTF‑16**. > > [`String::getBytes(Charset)`]: > https://docs.oracle.com/en/java/javase/25/docs/api/java.base/java/lang/String.html#getBytes(java.nio.charset.Charset)
Thanks! There is more work that could be done for other charsets here, I focused on UTF-8 and the bytesCompatible case as a proof of concept, and as a way to start discussing this. It may or may not make sense to have optimized paths for all of the other standard charsets. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/28454#discussion_r2556171650
