Hi Johannes,
I think the 3rd scenario you've mentioned is likely: we have Swedish or other 
languages that extend the ascii encoding with diacritics, which are non-ascii 
bytes are frequently interrupting ascii. For non-ascii heavy languages like 
Chinese, sometimes the text can include spaces or ascii digits; invoking the 
intrinsic for that scenario sounds a bit unwise too.

Regards,
Chen Liang
________________________________
From: core-libs-dev <core-libs-dev-r...@openjdk.org> on behalf of Johannes 
Döbler <j...@civilian-framework.org>
Sent: Monday, May 12, 2025 6:16 AM
To: core-libs-dev@openjdk.org <core-libs-dev@openjdk.org>
Subject: potential performance improvement in sun.nio.cs.UTF_8

I have a suggestion for a performance improvement in sun.nio.cs.UTF_8, the 
workhorse for stream based UTF-8 encoding and decoding, but don't know if this 
has been discussed before.
I explain my idea for the decoding case:
Claes Redestad describes in his blog 
https://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html  how he used 
SIMD intrinsics (now JavaLangAccess.decodeASCII) to speed UTF_8 decoding when 
buffers are backed by arrays:

https://github.com/openjdk/jdk/blob/0258d9998ebc523a6463818be00353c6ac8b7c9c/src/java.base/share/classes/sun/nio/cs/UTF_8.java#L231

  *   first a call to JLA.decodeASCII harvests all ASCII-characters (=1-byte 
UTF-8 sequence) at the beginning of the input
  *   then enters the slow loop of looking at UTF-8 byte sequences in the input 
buffer and writing to the output buffer (this is basically the old 
implementation)

If the input is all ASCII all decoding work is done in JLA.decodeASCII 
resulting in an extreme performance boost. But if the input contains a 
non-ASCII it will fall back to the slow array loop.

Now here is my idea: Why not call JLA.decodeASCI whenever an ASCII input is 
seen:

while (sp < sl) {
    int b1 = sa[sp];
    if (b1 >= 0) {
        // 1 byte, 7 bits: 0xxxxxxx
        if (dp >= dl)
            return xflow(src, sp, sl, dst, dp, 1);
        // my change
        int n = JLA.decodeASCII(sa, sp, da, dp, Math.min(sl - sp, dl - dp));
        sp += n;
        dp += n;
    } else if ((b1 >> 5) == -2 && (b1 & 0x1e) != 0) {

I setup a small improvised benchmark to get an idea of the impact:

Benchmark                     (data)   Mode  Cnt        Score   Error  Units
DecoderBenchmark.jdkDecoder  TD_8000  thrpt    2  2045960,037          ops/s
DecoderBenchmark.jdkDecoder  TD_3999  thrpt    2   263744,675          ops/s
DecoderBenchmark.jdkDecoder   TD_999  thrpt    2   154232,940          ops/s
DecoderBenchmark.jdkDecoder   TD_499  thrpt    2   142239,763          ops/s
DecoderBenchmark.jdkDecoder    TD_99  thrpt    2   128678,229          ops/s
DecoderBenchmark.jdkDecoder     TD_9  thrpt    2   127388,649          ops/s
DecoderBenchmark.jdkDecoder     TD_4  thrpt    2   119834,183          ops/s
DecoderBenchmark.jdkDecoder     TD_2  thrpt    2   111733,115          ops/s
DecoderBenchmark.jdkDecoder     TD_1  thrpt    2   102397,455          ops/s
DecoderBenchmark.newDecoder  TD_8000  thrpt    2  2022997,518          ops/s
DecoderBenchmark.newDecoder  TD_3999  thrpt    2  2909450,005          ops/s
DecoderBenchmark.newDecoder   TD_999  thrpt    2  2140307,712          ops/s
DecoderBenchmark.newDecoder   TD_499  thrpt    2  1171970,809          ops/s
DecoderBenchmark.newDecoder    TD_99  thrpt    2   686771,614          ops/s
DecoderBenchmark.newDecoder     TD_9  thrpt    2    95181,541          ops/s
DecoderBenchmark.newDecoder     TD_4  thrpt    2    65656,184          ops/s
DecoderBenchmark.newDecoder     TD_2  thrpt    2    45439,240          ops/s
DecoderBenchmark.newDecoder     TD_1  thrpt    2    36994,738          ops/s

(The benchmark uses only memory buffers, each test input is a UTF-8 encoded 
byte buffer which produces 8000 chars and consists of various length of pure 
ascii bytes, followed by a 2-byte UTF-8 sequence producing a non-ASCII char:
TD_8000: 8000 ascii bytes -> 1 call to JLA.decodeASCII
TD_3999: 3999 ascii bytes + 2 non-ascii bytes, repeated 2 times -> 2 calls to 
JLA.decodeASCII
...
TD_1: 1 ascii byte + 2 non-ascii bytes, repeated 4000 times -> 4000 calls to 
JLA.decodeASCII

Interpretation:

  *   Input all ASCII: same performance as before
  *   Input contains pure ASCII sequence of considerable length interrupted by 
non ASCII bytes: now seeing huge performance improvements similar to the pure 
ASCII case.
  *   Input has lot of short sequences of ASCII-bytes interrupted by non ASCII 
bytes: at some point performance drops below current implementation.

Thanks for reading and happy to hear your opinions,
Johannes

Reply via email to