Hi Johannes, I think the 3rd scenario you've mentioned is likely: we have Swedish or other languages that extend the ascii encoding with diacritics, which are non-ascii bytes are frequently interrupting ascii. For non-ascii heavy languages like Chinese, sometimes the text can include spaces or ascii digits; invoking the intrinsic for that scenario sounds a bit unwise too.
Regards, Chen Liang ________________________________ From: core-libs-dev <core-libs-dev-r...@openjdk.org> on behalf of Johannes Döbler <j...@civilian-framework.org> Sent: Monday, May 12, 2025 6:16 AM To: core-libs-dev@openjdk.org <core-libs-dev@openjdk.org> Subject: potential performance improvement in sun.nio.cs.UTF_8 I have a suggestion for a performance improvement in sun.nio.cs.UTF_8, the workhorse for stream based UTF-8 encoding and decoding, but don't know if this has been discussed before. I explain my idea for the decoding case: Claes Redestad describes in his blog https://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html how he used SIMD intrinsics (now JavaLangAccess.decodeASCII) to speed UTF_8 decoding when buffers are backed by arrays: https://github.com/openjdk/jdk/blob/0258d9998ebc523a6463818be00353c6ac8b7c9c/src/java.base/share/classes/sun/nio/cs/UTF_8.java#L231 * first a call to JLA.decodeASCII harvests all ASCII-characters (=1-byte UTF-8 sequence) at the beginning of the input * then enters the slow loop of looking at UTF-8 byte sequences in the input buffer and writing to the output buffer (this is basically the old implementation) If the input is all ASCII all decoding work is done in JLA.decodeASCII resulting in an extreme performance boost. But if the input contains a non-ASCII it will fall back to the slow array loop. Now here is my idea: Why not call JLA.decodeASCI whenever an ASCII input is seen: while (sp < sl) { int b1 = sa[sp]; if (b1 >= 0) { // 1 byte, 7 bits: 0xxxxxxx if (dp >= dl) return xflow(src, sp, sl, dst, dp, 1); // my change int n = JLA.decodeASCII(sa, sp, da, dp, Math.min(sl - sp, dl - dp)); sp += n; dp += n; } else if ((b1 >> 5) == -2 && (b1 & 0x1e) != 0) { I setup a small improvised benchmark to get an idea of the impact: Benchmark (data) Mode Cnt Score Error Units DecoderBenchmark.jdkDecoder TD_8000 thrpt 2 2045960,037 ops/s DecoderBenchmark.jdkDecoder TD_3999 thrpt 2 263744,675 ops/s DecoderBenchmark.jdkDecoder TD_999 thrpt 2 154232,940 ops/s DecoderBenchmark.jdkDecoder TD_499 thrpt 2 142239,763 ops/s DecoderBenchmark.jdkDecoder TD_99 thrpt 2 128678,229 ops/s DecoderBenchmark.jdkDecoder TD_9 thrpt 2 127388,649 ops/s DecoderBenchmark.jdkDecoder TD_4 thrpt 2 119834,183 ops/s DecoderBenchmark.jdkDecoder TD_2 thrpt 2 111733,115 ops/s DecoderBenchmark.jdkDecoder TD_1 thrpt 2 102397,455 ops/s DecoderBenchmark.newDecoder TD_8000 thrpt 2 2022997,518 ops/s DecoderBenchmark.newDecoder TD_3999 thrpt 2 2909450,005 ops/s DecoderBenchmark.newDecoder TD_999 thrpt 2 2140307,712 ops/s DecoderBenchmark.newDecoder TD_499 thrpt 2 1171970,809 ops/s DecoderBenchmark.newDecoder TD_99 thrpt 2 686771,614 ops/s DecoderBenchmark.newDecoder TD_9 thrpt 2 95181,541 ops/s DecoderBenchmark.newDecoder TD_4 thrpt 2 65656,184 ops/s DecoderBenchmark.newDecoder TD_2 thrpt 2 45439,240 ops/s DecoderBenchmark.newDecoder TD_1 thrpt 2 36994,738 ops/s (The benchmark uses only memory buffers, each test input is a UTF-8 encoded byte buffer which produces 8000 chars and consists of various length of pure ascii bytes, followed by a 2-byte UTF-8 sequence producing a non-ASCII char: TD_8000: 8000 ascii bytes -> 1 call to JLA.decodeASCII TD_3999: 3999 ascii bytes + 2 non-ascii bytes, repeated 2 times -> 2 calls to JLA.decodeASCII ... TD_1: 1 ascii byte + 2 non-ascii bytes, repeated 4000 times -> 4000 calls to JLA.decodeASCII Interpretation: * Input all ASCII: same performance as before * Input contains pure ASCII sequence of considerable length interrupted by non ASCII bytes: now seeing huge performance improvements similar to the pure ASCII case. * Input has lot of short sequences of ASCII-bytes interrupted by non ASCII bytes: at some point performance drops below current implementation. Thanks for reading and happy to hear your opinions, Johannes