Re: Rewrite of IBM doublebyte charsets

Ulf Zibis Thu, 21 May 2009 16:41:45 -0700

Am 21.05.2009 00:22, Xueming Shen schrieb:


Ulf Zibis wrote:


(6) Unload b2cStr from memory after startup:
   - outsource b2cStr to additional class file like EUC_TW approach
   - set b2cStr = null after startup (remove final modifier)
  Benefit[6]: avoid 100 % superfluous memory-footprint

I doubt it really saves something real, since the "class" should stillkeep its copy somewhere...and

I will need it for c2b (now I'm "delaying" the c2b init)

I always thought, setting an object to null after use, it would beautomatically GCed. Am I wrong?... but we can do c2binit from b2c[][] instead from b2cstr[], so whysaving it.

(7) Avoid copying b2cStr to b2c:
   (String#charAt() is fast as char[] access)
  Benefit[7]: increase startup performance for decoder
I tried again last night. char[][] is much faster than the String[]version in both clientand server vm. So keep it asis. (this was actually I switched fromString[] to char[][])

I'm surprised, because I had in mind from older benchmarks, thatchar_array[index] had same speed than str.charAt(index) afteroptimization from hotspot.I also had this results here:https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/branches/array_io_string/src/sun/nio/cs/SingleByteFastDecoder.java?rev=&view=markup

(12) Get rid of sun.io package dependency:
https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/tags/milestone2/src/sun/io/
  Benefit[13]: avoid superfluous disk-footprint
  Benefit[14]: save maintenance of sun.io converters
Disadvantage[1]: published under JRL (waiting for launch ofOpenJDK-7 project "charset-enhancement") ;-)
This is not something about engineering. It's about license, policy...


So hopefully we would have OpenJDK7 project "charset-enhancement" soon.


(17) Decoder#decodeArrayLoop: shortcut for single byte only:
     int sr = src.remaining();
     int sl = sp + sr;
     int dr = dst.remaining();
     int dl = dp + dr;
     // single byte only loop
     int slSB = sp + sr < dr ? sr : dr;
     while (sp < slSB) {
         char c = b2cSB[sa[sp] && 0xff];
         if (c == UNMAPPABLE_DECODING)
             break;
         da[dp++] = c;
         sp++;
     }
    Same for Encoder#encodeArrayLoop

(18) Decoder_EBCDIC: boolean singlebyteState:
     if (singlebyteState)
         ...

(19) Decoder_EBCDIC: decode single byte first:
     if (singlebyteState)
         c = b2cSB[b1];
     if (c == UNMAPPABLE_DECODING) {
         ...
     }
  Benefit[20]: should be faster

Not like when we dealing with singlebyte charsets. For doublebytecharsetsthe priority should be given to doublebyte codepoints, if possible.Not single

byte codepoints.

- I am in assumption that having singlebyte-only input is common usecase. Am I wrong in case of EBCDIC ?- This hack doesn't make processing of "normal" mixed input slower afterescaping to "normal" while(...)-loop.- This hack was copied from UTF-8 coder, where ASCII-only input iscommon use case.

*** Encoder-Suggestions:

(21) join *.nr to *.c2b files (25->000a becomes 000a->fffd):
  Benefit[21]: reduce no. of files
  Benefit[22]: simplifies initC2B() (avoids 2 loops)
In theory you can do some magic to "join" .nr into .c2b. The pricemight be more complicatedlogic depends on the codepoints. You may end up doing some tablelookup for each codepoint
in b2c when processing.

This "magic" should be done in GenerateDBCS.java, so the price must onlybe paid once while building the JDK. But to be honest, it could be doneby hand, for those few mapping pairs. See my single-byte IBMxxx mappingshere:

https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/trunk/make/tools/CharsetMapping/ext/
... and don't forget, it prevents from copying the whole b2c.


And big thanks for all the suggestions.


Thanks for your appreciation. :-)

-Ulf

Re: Rewrite of IBM doublebyte charsets

Reply via email to