Re: Rewrite of IBM doublebyte charsets

Xueming Shen Mon, 11 May 2009 23:30:42 -0700

Ulf, thanks for taking time reviewing the changes.

It seems most of the suggestions are based on two approaches you adoptedin your charset-improvement project

(1)extract the mapping data into a binary data file and load in the dataas the resource file during initialization of

charset decoder/encoder

(2)totally depends on catching the IndexOutOfBoundsException to avoidsanity check on any array index when access

the array data

For (1), as I explained last time, the 30% more size saving isattractive, but the magnitude of the charset startup/initialization regression (reading data from the resource inputstream ismuch slower than loading in the data as a constantString object of a class file) is not acceptable. I do have a version ofimplementation that goes that approach, I evensent the webrev to Alan and Masayoshi for review, I "recalled" thereview request after realized the startup regression.That was 6 months ago (I uploaded the webrev athttp://cr.openjdk.java.net/~sherman/Old_newEUC_TW.)

So for now it's not a choice for me to extract the data out.

For (2), I'm not convinced that this approach is an appropriate one fora complicated charset like EUC_TW,given the number of array it carries, the recovery work (to trace backto what goes wrong and then return theappropriate CoderResult) will be complicated and redundant...). Thismight have a benefit of saving the rangecheck (I don't have any data to show how much we can gain from doingthis, only a guess), but given almost allsegments are near "full", I don't see the benefit on the footprintsaving side. We need some hard data to supportthis approach, which I don't have for now. I would leave this one foryou for further optimization in your project.

I have updated the webrev to address some of your other optimizationsuggestions

(1)simplify the "plane number" byte check by adding a new static arrayof cnspToIndex[16] for decoder(2)only do the isLeagalDB check when lookup failed (assume most bytesfed are legal...) for decoder(3)save plane no as 0x0, 0x2...0x7 and 0xf for encoder (this helps savethe plane number check during

encoding).

Something I don't think I'm going to do

(1)No I don't think we want to save the supplementary into surrogatepair, this is what I'm trying to fix. We don'tcare the performance of surrogates, those codepoints are RARE used, 99%+coding/decoding happens inBMP, we did not have the supplementary characters for the first coupleyears. (OK, I'm a native, I don't think

I can even read those characters)

(2)The initialization c2b data for encoder has already been "lazied"until Encoder class gets loaded.

(3)I doubt use BitSet for b2cIsSupp will save any memory, or faster.b2cIsSupp has been fully utilized to

use each/every bit of its storage.

I gave the latest version a "quick" run, I'm happy with the performancegain, especially the encoding.


-client
   Encoding TimeRatio EUC_TW_OLD/x-EUC-TW: 84927,10151 :8.366368d
   Decoding TimeRatio EUC_TW_OLD/x-EUC-TW: 13162,9237 :1.424922d
   Encoding(dir) TimeRatio EUC_TW_OLD/x-EUC-TW: 94829,19950 :4.753333d
   Decoding(dir) TimeRatio EUC_TW_OLD/x-EUC-TW: 21884,17992 :1.216318d

-server

   Encoding TimeRatio EUC_TW_OLD/x-EUC-TW: 74651,12218 :6.109920d
   Decoding TimeRatio EUC_TW_OLD/x-EUC-TW: 11686,10364 :1.127557d
   Encoding(dir) TimeRatio EUC_TW_OLD/x-EUC-TW: 79058,12187 :6.487076d
   Decoding(dir) TimeRatio EUC_TW_OLD/x-EUC-TW: 13518,10198 :1.325554d

The latest webrev is at

http://cr.openjdk.java.net/~sherman/6831794_6229811/webrev/

Thanks,
Sherman


Ulf Zibis wrote:

Completed ...:


*** Decoder-Suggestions:

(10) Split map data files into chunks and load lazy.
   TW native speakers must be consulted, to define reasonable chunks!
   Benefit[17]: save startup time
   Benefit[18]: save memory

(11) Use java.util.BitSet for b2cIsSupp
   Benefit[19]: save memory, maybe faster


*** Encoder-Suggestions:
(21) Initialize encoder mappings lazy, maybe split into reasonablechunks:
   Benefit[21]: increase startup performance for de/encoder

(21) Save c2b and c2bPlane in 2-dimensional array:
     char[][] c2b = new char[0x100][]
     only instantiate actually used segments:
     c2b[x] = new char[0x100]
Benefit[22]: save lookup and calculation of index, but add 1indirectionBenefit[23]: save range-check for segment index (catch malformedsegment index by NPE)
   Benefit[24]: save c2bIndex
(22) In case of surrogate code points, use high surrogate (8 lowerbits) as segment index:
     char[][] c2bSupp = new char[0x100][]
     only instantiate actually used segments:
     c2bSupp[x] = new char[0x400]
Benefit[25]: save encoding to UC4 from surrogate pairs (I guess,this would significantly increase performance)Benefit[26]: save lookup and calculation of index, but add 1indirectionBenefit[27]: save range-check for segment index (catch malformedsegment index by NPE)
   Benefit[28]: save c2bSuppIndex

(23) Truncate c2b segments:
     c2b[x] = new char[usedLength]
(usedLength values could be generated and saved in EUC_TWMappingor data file)Benefit[29]: avoid superfluous memory and disk-footprint (I guess~30 %)Benefit[30]: don't range-check in-segment index, catch unmappableindex by IndexOutOfBoundsException
(24) Additinally truncate leading unmappables in c2b segments, andhost offsets:Benefit[31]: avoid another superfluous memory and disk-footprint (Iguess ~10 %)
   Disadvantage[21]: needs hosting of offsets: 256 bytes
(25) Concerning (23),(24): Check out best segment size (maybe 256 isnot optimal):Benefit[32]: avoid another superfluous memory and disk-footprint (Iguess 10-20 %)
(26) Concerning (22),(23),(24): maybe use 3-dim array and check outbest segment size (maybe 10 bit is not optimal):Benefit[33]: avoid another superfluous memory and disk-footprint (Iguess 10-20 %)
(27) Save Plane no. as 0x0, 0x2 .. 0x7 and 0xf:
   Benefit[34]: simplify calculation of 2nd byte, increases performance
(28) Save 2nd byte in c2bPlane directly (0xa2 .. 0xa7 and 0xaf)instead of Plane no.:
   Benefit[35]: save calculation of 2nd byte, increases performance
   Disadvantage[22]: increases c2bPlane by ~73%

-Ulf


EUC_TW statistics (updated):

Plane   range   length  segments  segments-usage-ratio

0    a1a1-fdcb   5868   5d = 93   66 %
_0   a1a1-a744    434    7 = 7    65 %
_1   c2a1-fdcb   5434   3c = 60   95 %

1:8ea2   -f2c4   7650   52 = 82   98 %
2:8ea3   -e7aa   6394   47 = 71   95 %
3:8ea4   -eedc   7286   4e = 78   98 %
4:8ea5   -fcd1   8601   5c = 92   98 %
5:8ea6   -e4fa   6385   44 = 68   99 %
6:8ea7   -ebd5   6532   4b = 75   98 %
7:8eaf   -edb9   8721   4d = 77   92 %

Sum:             55446  262 = 610

max b1 range: 5d = 93
max b2 range: 5e = 94

memory amount for all segments (not truncated):
610 * 94 = 57,340 code points
truncated -4 % :  ~55,000 code points
decoder surrogate mapping (*3):  ~165,000 bytes

disk-footprint of EUC_TWMapping (1. Approach from Sherman):
b2c           : 8 * 94 * 94 * 2.97 = 209,943 Bytes
b2cIsSuppStr  : 94 * 94 * 1.48     =  13,077
c2bIndex      : 256 * 7            =   1,792
c2bSuppIndex  : 256 * 7            =   1,792
Sum                                 ~227,000 Bytes

memory of EUC_TW (1. Approach from Sherman):
b2c           : 8 * 94 * 94 * 2 = 141,376 Bytes
b2cIsSupp     : 94 * 94         =   8,836
decoder sum   :                   150,212
c2b           : 31744 * 2       =  63,488
c2bIndex      : 256 * 2         =     512
c2bSupp       : 43520 * 2       =  87,040
c2bSuppIndex  : 256 * 2         =     512
c2bPlane      : 43520 * 1       =  43,520
encoder sum   :                   195,072

Re: Rewrite of IBM doublebyte charsets

Reply via email to