Re: RFR: 8289834: Add SBCS and DBCS Only EBCDIC charsets

Ichiroh Takiguchi Mon, 03 Oct 2022 00:16:52 -0700

On Fri, 26 Aug 2022 09:25:55 GMT, Alan Bateman <[email protected]> wrote:


>> OpenJDK supports "Japanese EBCDIC - Katakana" and "Korean EBCDIC" SBCS and 
>> DBCS Only charsets.
>> |Charset|Mix|SBCS|DBCS|
>> | -- | -- | -- | -- |
>> | Japanese EBCDIC - Katakana | Cp930 | Cp290 | Cp300 |
>> | Korean | Cp933 | Cp833 | Cp834 |
>> 
>> But OpenJDK does not supports some of "Japanese EBCDIC - English" / 
>> "Simplified Chinese EBCDIC" / "Traditional Chinese EBCDIC" SBCS and DBCS 
>> Only charsets.
>> 
>> I'd like to request Cp1027/Cp835/Cp836/Cp837 for consistency
>> |Charset|Mix|SBCS|DBCS|
>> | ------------- | ------------- | ------------- | ------------- |
>> | Japanese EBCDIC - English | Cp939 | **Cp1027** | Cp300 |
>> | Simplified Chinese EBCDIC | Cp935 | **Cp836** | **Cp837** |
>> | Traditional Chinese EBCDIC | Cp937 | (*1) | **Cp835** | 
>> 
>> *1: Cp037 compatible
>
>> Use following options, like OpenJDK: `java -cp 
>> icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047 20000 1 1` ICU4J `java 
>> -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047_P100-1995 20000 1 1`
>> 
>> Actually, I'm confused by this result. Previously, I was just comparing A/A 
>> with B/B on OpenJDK's charset. I didn't think ICU4J's result would make a 
>> difference.
> 
> My initial reaction is one of relief that the icu4j provider can be used with 
> current JDK builds. This means there is an option should we decide to stop 
> adding more EBCDIC charsets to the JDK.
> 
> The test uses IBM-1047 and I can't tell if the icu4j provider is used or not. 
> Charset doesn't define a provider method but I think would be useful to print 
> cs.getClass() or cs.getClass().getModule() so we know which Charset 
> implementation is used. Also I think any discussion on performance would be 
> better served with a JMH benchmark rather than a standalone test.

Hello @AlanBateman .
Sorry I'm late.

I created Charset SPI JAR `x-IBM1047_SPI` (`custom-charsets.jar`) which was 
ported from `sun.nio.cs.SingleByte.java` and `IBM1047.java` (generated one).

Test code:

package com.example;

import java.nio.charset.Charset;
import org.openjdk.jmh.annotations.Benchmark;

public class MyBenchmark {

    final static String s;

    static {
        char[] ca = new char[0x2000];
        for (int i = 0; i < ca.length; i++) {
            ca[i] = (char) (i & 0xFF);
        }
        s = new String(ca);
    }

    @Benchmark
    public void testIBM1047() throws Exception {
        byte[] ba = s.getBytes("IBM1047");
    }

    @Benchmark
    public void testIBM1047_SPI() throws Exception {
        byte[] ba = s.getBytes("x-IBM1047_SPI");
    }

}

All test related files are in 
[JDK-8289834](https://bugs.openjdk.org/browse/JDK-8289834).

Test results are as follows on RHEL8.6 x86_64 (Intel Core i7 3520M) :

1.8.0_345-b01
Benchmark                     Mode  Cnt      Score     Error  Units
MyBenchmark.testIBM1047      thrpt   25  53213.092 ± 126.962  ops/s
MyBenchmark.testIBM1047_SPI  thrpt   25  47442.669 ± 349.003  ops/s


20-ea+17-1181
Benchmark                     Mode  Cnt       Score      Error  Units
MyBenchmark.testIBM1047      thrpt   25  136331.141 ± 1078.481  ops/s
MyBenchmark.testIBM1047_SPI  thrpt   25   51563.213 ±  843.238  ops/s

IBM1047 is 2.6 times faster than the SPI version on JDK20.
I think this results are related to **JEP 254: Compact Strings** .
As I requested before, we'd like to use `sun.nio.cs.SingleByte*` and 
`sun.nio.cs.DoubleByte*` class as public API.

-------------

PR: https://git.openjdk.org/jdk/pull/9399

Re: RFR: 8289834: Add SBCS and DBCS Only EBCDIC charsets

Reply via email to