Re: Rewrite of IBM doublebyte charsets

Xueming Shen Wed, 20 May 2009 16:50:31 -0700

Thanks for the 5 minutes:-)

Your FindXYZcoderBugs tests are indeed very helpful to catch most of the"inconsistent" behaviors

between different paths by feeding the "random" inputs.

The TestIBMDB.java is diffing the behaviors of old implementation andnew implementationwith all "decode-able" bytes and "encode-able" chars...so it gives ussome of the guarantee.


Oh, the typos are being fixed:-)

Sherman

Martin Buchholz wrote:

5-minute review.

Looks pretty good.

Thanks for fixing old bugs found by FindXXcoderBugs.java

Thanks Ulf for your hard work on implementation and review.

Again, charset changes are very hard to review, hence scary,
but relatively easy to test in an automated fashion.

I imagine generating tests with
random input,
random charset,
random choice of direct/non-direct buffers,
and checking that charset behavior is unchanged
except for performance.

---
// it's time to generate the class file
Confusing - you're generating a source file,
and the phrase "class file" usually means
XX.class

typos:

typitcal

date structure.correspoinding


ruondtrip

*file dbcs has trailing newlines

*Martin

On Wed, May 20, 2009 at 15:22, Xueming Shen <[email protected]<mailto:[email protected]>> wrote:


    Ulf, thanks for the review!

    Now we have a bugid for this one. The webrev has been updated to
    address some of your
    suggestions (see my below) and rename to

    http://cr.openjdk.java.net/~sherman/6843578/webrev
    <http://cr.openjdk.java.net/%7Esherman/6843578/webrev>
    (old webrevs are still there with webrev.0x)

    With this change,
    (1)21% reduce of rt.jar size (from 2.99M-with the euc_tw reduce to
    2.35M)
    (2)much faster coding performance,  a quick/not_really_a_benchmark
    result can been found at
      http://cr.openjdk.java.net/~sherman/6843578/codingPerm
    <http://cr.openjdk.java.net/%7Esherman/6843578/codingPerm>
    (3)also addressed the bugids listed below.

    6639450: IBM949C encoder modifies state of IBM949 encoder
    6569191: Cp943 io converter returns U+0000 and U+FFFD for
    unconvertable character
    6577466: Character encoder IBM970 throws a BufferOverflowException
    5065777: CharsetEncoder canEncode() methods often incorrectly
    return false

    The test case test/sun/nio/cs/OLD/TestIBMDB.java is used to make
    sure there is no mapping
    regression, there are new mapping entries added (because switching
    to the latest IBM cdc
    mapping tables for some charsets, which the b2c tables we have are
    not accurate/complete), but
    no conflit with previous mappings.

    Alan, Martin, I need you guys also help review the change, I will
    need a name on the putback message.
    The number of file is huge, but only the files listed at beginning
    need attention:-)

    Sherman


    Ulf Zibis wrote:


        *** Decoder-Suggestions:

        (1) Unused imports in DoubleByte-X.java:
          import java.util.Arrays;
          import sun.nio.cs.StandardCharsets;
          import static sun.nio.cs.CharsetMapping.*;
          import sun.nio.cs.ext.DoubleByte;  // or instead: static
        sun.nio.cs.ext.DoubleByte.*;

    done! good catch.



        (3) Modify dimension of b2c:
            char[][] b2c = new char[0x100][segSize];
          so decode :
            public char decodeDouble(int b1, int b2) {
                if ((b2-=b2Min) < 0 || b2 >= segSize)
                    return UNMAPPABLE_DECODING;
                return b2c[b1][b2];
            }
         Benefit[1]: increase performance of decoder
         Benefit[2]: reduce memory of B2C_UNMAPPABLE from 8192 to 512
        bytes
         Benefit[3]: some of b2c pages could be saved (if only
        containing \uFFFD)

    A good suggestion. I tried, and it does bring in about 10%+% gain
    of decoding (for non-direct buffer,
    we also have similar gian in direct buffer, but the % number is
    not this big). The webrev has been updated
    for this one.



        (4) Don't care about b2Max (it's always not far from 0xff):
         Benefit[4]: another performance increase of decoder (only
        check: (b2-=b2Min) < 0)

    Decided to keep it. We do have some charsets (not included this
    time) that care about b2max.



        (5) Truncate String segments (there are 65 % "\uFFFD" in IBM933):
          (fill b2c segments first with "\uFFFD", then initialize)
         Benefit[5]: save up to 180 % superfluous memory and
        disk-footprint

    Changing from the b2c index from 4-bit to 8-bit (#2) should have
    already improved this a lot (I see
    a 1.5%+ improvement of overall charsets.jar)



        (6) Unload b2cStr from memory after startup:
          - outsource b2cStr to additional class file like EUC_TW approach
          - set b2cStr = null after startup (remove final modifier)
         Benefit[6]: avoid 100 % superfluous memory-footprint

    I doubt it really saves something real, since the "class" should
    still keep its copy somewhere...and
    I will need it for c2b (now I'm "delaying" the c2b init)


        (7) Avoid copying b2cStr to b2c:
          (String#charAt() is fast as char[] access)
         Benefit[7]: increase startup performance for decoder

    I tried again last night. char[][] is much faster than the
    String[] version in both client
    and server vm. So keep it asis. (this was actually I switched from
    String[] to char[][])




        (9) Share mappings (IBM930 and IBM939 are 99 % identical):
         Benefit[9]: save up to 99 % superfluous disk-footprint
         Benefit[10]: save up to 99 % superfluous memory-footprint (if
        both charsets are loaded)

    I still want to keep one map for one charset, at least for now. It
    might be something worth researching
    in the future.


        (12) Get rid of sun.io <http://sun.io> package dependency:
         
https://java-nio-charset-enhanced.dev.java.net/source/browse/java-nio-charset-enhanced/tags/milestone2/src/sun/io/

         Benefit[13]: avoid superfluous disk-footprint
         Benefit[14]: save maintenance of sun.io <http://sun.io>
        converters
         Disadvantage[1]: published under JRL (waiting for launch of
        OpenJDK-7 project "charset-enhancement") ;-)

    This is not something about engineering. It's about license,
    policy...



        (17) Decoder#decodeArrayLoop: shortcut for single byte only:
            int sr = src.remaining();
            int sl = sp + sr;
            int dr = dst.remaining();
            int dl = dp + dr;
            // single byte only loop
            int slSB = sp + sr < dr ? sr : dr;
            while (sp < slSB) {
                char c = b2cSB[sa[sp] && 0xff];
                if (c == UNMAPPABLE_DECODING)
                    break;
                da[dp++] = c;
                sp++;
            }
           Same for Encoder#encodeArrayLoop

        (18) Decoder_EBCDIC: boolean singlebyteState:
            if (singlebyteState)
                ...

        (19) Decoder_EBCDIC: decode single byte first:
            if (singlebyteState)
                c = b2cSB[b1];
            if (c == UNMAPPABLE_DECODING) {
                ...
            }
         Benefit[20]: should be faster


    Not like when we dealing with singlebyte charsets. For doublebyte
    charsets
    the priority should be given to doublebyte codepoints, if
    possible. Not single
    byte codepoints.



        *** Encoder-Suggestions:

        (21) join *.nr to *.c2b files (25->000a becomes 000a->fffd):
         Benefit[21]: reduce no. of files
         Benefit[22]: simplifies initC2B() (avoids 2 loops)

    In theory you can do some magic to "join" .nr into .c2b. The price
    might be more complicated
    logic depends on the codepoints. You may end up doing some table
    lookup for each codepoint
    in b2c when processing.

    And big thanks for all the suggestions.

Re: Rewrite of IBM doublebyte charsets

Reply via email to