On Fri, 26 Aug 2022 09:25:55 GMT, Alan Bateman <[email protected]> wrote:
>> OpenJDK supports "Japanese EBCDIC - Katakana" and "Korean EBCDIC" SBCS and
>> DBCS Only charsets.
>> |Charset|Mix|SBCS|DBCS|
>> | -- | -- | -- | -- |
>> | Japanese EBCDIC - Katakana | Cp930 | Cp290 | Cp300 |
>> | Korean | Cp933 | Cp833 | Cp834 |
>>
>> But OpenJDK does not supports some of "Japanese EBCDIC - English" /
>> "Simplified Chinese EBCDIC" / "Traditional Chinese EBCDIC" SBCS and DBCS
>> Only charsets.
>>
>> I'd like to request Cp1027/Cp835/Cp836/Cp837 for consistency
>> |Charset|Mix|SBCS|DBCS|
>> | ------------- | ------------- | ------------- | ------------- |
>> | Japanese EBCDIC - English | Cp939 | **Cp1027** | Cp300 |
>> | Simplified Chinese EBCDIC | Cp935 | **Cp836** | **Cp837** |
>> | Traditional Chinese EBCDIC | Cp937 | (*1) | **Cp835** |
>>
>> *1: Cp037 compatible
>
>> Use following options, like OpenJDK: `java -cp
>> icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047 20000 1 1` ICU4J `java
>> -cp icu4j-71_1.jar:icu4j-charset-71_1.jar:. tc IBM-1047_P100-1995 20000 1 1`
>>
>> Actually, I'm confused by this result. Previously, I was just comparing A/A
>> with B/B on OpenJDK's charset. I didn't think ICU4J's result would make a
>> difference.
>
> My initial reaction is one of relief that the icu4j provider can be used with
> current JDK builds. This means there is an option should we decide to stop
> adding more EBCDIC charsets to the JDK.
>
> The test uses IBM-1047 and I can't tell if the icu4j provider is used or not.
> Charset doesn't define a provider method but I think would be useful to print
> cs.getClass() or cs.getClass().getModule() so we know which Charset
> implementation is used. Also I think any discussion on performance would be
> better served with a JMH benchmark rather than a standalone test.
Hello @AlanBateman .
Sorry I'm late.
I created Charset SPI JAR `x-IBM1047_SPI` (`custom-charsets.jar`) which was
ported from `sun.nio.cs.SingleByte.java` and `IBM1047.java` (generated one).
Test code:
package com.example;
import java.nio.charset.Charset;
import org.openjdk.jmh.annotations.Benchmark;
public class MyBenchmark {
final static String s;
static {
char[] ca = new char[0x2000];
for (int i = 0; i < ca.length; i++) {
ca[i] = (char) (i & 0xFF);
}
s = new String(ca);
}
@Benchmark
public void testIBM1047() throws Exception {
byte[] ba = s.getBytes("IBM1047");
}
@Benchmark
public void testIBM1047_SPI() throws Exception {
byte[] ba = s.getBytes("x-IBM1047_SPI");
}
}
All test related files are in
[JDK-8289834](https://bugs.openjdk.org/browse/JDK-8289834).
Test results are as follows on RHEL8.6 x86_64 (Intel Core i7 3520M) :
1.8.0_345-b01
Benchmark Mode Cnt Score Error Units
MyBenchmark.testIBM1047 thrpt 25 53213.092 ± 126.962 ops/s
MyBenchmark.testIBM1047_SPI thrpt 25 47442.669 ± 349.003 ops/s
20-ea+17-1181
Benchmark Mode Cnt Score Error Units
MyBenchmark.testIBM1047 thrpt 25 136331.141 ± 1078.481 ops/s
MyBenchmark.testIBM1047_SPI thrpt 25 51563.213 ± 843.238 ops/s
IBM1047 is 2.6 times faster than the SPI version on JDK20.
I think this results are related to **JEP 254: Compact Strings** .
As I requested before, we'd like to use `sun.nio.cs.SingleByte*` and
`sun.nio.cs.DoubleByte*` class as public API.
-------------
PR: https://git.openjdk.org/jdk/pull/9399