Ulf, thanks for taking time reviewing the changes.

It seems most of the suggestions are based on two approaches you adopted in your charset-improvement project

(1)extract the mapping data into a binary data file and load in the data as the resource file during initialization of
charset decoder/encoder

(2)totally depends on catching the IndexOutOfBoundsException to avoid sanity check on any array index when access
the array data

For (1), as I explained last time, the 30% more size saving is attractive, but the magnitude of the charset startup/ initialization regression (reading data from the resource inputstream is much slower than loading in the data as a constant String object of a class file) is not acceptable. I do have a version of implementation that goes that approach, I even sent the webrev to Alan and Masayoshi for review, I "recalled" the review request after realized the startup regression. That was 6 months ago (I uploaded the webrev at http://cr.openjdk.java.net/~sherman/Old_newEUC_TW.)
So for now it's not a choice for me to extract the data out.

For (2), I'm not convinced that this approach is an appropriate one for a complicated charset like EUC_TW, given the number of array it carries, the recovery work (to trace back to what goes wrong and then return the appropriate CoderResult) will be complicated and redundant...). This might have a benefit of saving the range check (I don't have any data to show how much we can gain from doing this, only a guess), but given almost all segments are near "full", I don't see the benefit on the footprint saving side. We need some hard data to support this approach, which I don't have for now. I would leave this one for you for further optimization in your project.

I have updated the webrev to address some of your other optimization suggestions

(1)simplify the "plane number" byte check by adding a new static array of cnspToIndex[16] for decoder (2)only do the isLeagalDB check when lookup failed (assume most bytes fed are legal...) for decoder (3)save plane no as 0x0, 0x2...0x7 and 0xf for encoder (this helps save the plane number check during
encoding).

Something I don't think I'm going to do

(1)No I don't think we want to save the supplementary into surrogate pair, this is what I'm trying to fix. We don't care the performance of surrogates, those codepoints are RARE used, 99%+ coding/decoding happens in BMP, we did not have the supplementary characters for the first couple years. (OK, I'm a native, I don't think
I can even read those characters)

(2)The initialization c2b data for encoder has already been "lazied" until Encoder class gets loaded.

(3)I doubt use BitSet for b2cIsSupp will save any memory, or faster. b2cIsSupp has been fully utilized to
use each/every bit of its storage.


I gave the latest version a "quick" run, I'm happy with the performance gain, especially the encoding.

-client
   Encoding TimeRatio EUC_TW_OLD/x-EUC-TW: 84927,10151 :8.366368d
   Decoding TimeRatio EUC_TW_OLD/x-EUC-TW: 13162,9237 :1.424922d
   Encoding(dir) TimeRatio EUC_TW_OLD/x-EUC-TW: 94829,19950 :4.753333d
   Decoding(dir) TimeRatio EUC_TW_OLD/x-EUC-TW: 21884,17992 :1.216318d

-server

   Encoding TimeRatio EUC_TW_OLD/x-EUC-TW: 74651,12218 :6.109920d
   Decoding TimeRatio EUC_TW_OLD/x-EUC-TW: 11686,10364 :1.127557d
   Encoding(dir) TimeRatio EUC_TW_OLD/x-EUC-TW: 79058,12187 :6.487076d
   Decoding(dir) TimeRatio EUC_TW_OLD/x-EUC-TW: 13518,10198 :1.325554d

The latest webrev is at

http://cr.openjdk.java.net/~sherman/6831794_6229811/webrev/

Thanks,
Sherman


Ulf Zibis wrote:
Completed ...:


*** Decoder-Suggestions:

(10) Split map data files into chunks and load lazy.
   TW native speakers must be consulted, to define reasonable chunks!
   Benefit[17]: save startup time
   Benefit[18]: save memory

(11) Use java.util.BitSet for b2cIsSupp
   Benefit[19]: save memory, maybe faster


*** Encoder-Suggestions:

(21) Initialize encoder mappings lazy, maybe split into reasonable chunks:
   Benefit[21]: increase startup performance for de/encoder

(21) Save c2b and c2bPlane in 2-dimensional array:
     char[][] c2b = new char[0x100][]
     only instantiate actually used segments:
     c2b[x] = new char[0x100]
Benefit[22]: save lookup and calculation of index, but add 1 indirection Benefit[23]: save range-check for segment index (catch malformed segment index by NPE)
   Benefit[24]: save c2bIndex

(22) In case of surrogate code points, use high surrogate (8 lower bits) as segment index:
     char[][] c2bSupp = new char[0x100][]
     only instantiate actually used segments:
     c2bSupp[x] = new char[0x400]
Benefit[25]: save encoding to UC4 from surrogate pairs (I guess, this would significantly increase performance) Benefit[26]: save lookup and calculation of index, but add 1 indirection Benefit[27]: save range-check for segment index (catch malformed segment index by NPE)
   Benefit[28]: save c2bSuppIndex

(23) Truncate c2b segments:
     c2b[x] = new char[usedLength]
(usedLength values could be generated and saved in EUC_TWMapping or data file) Benefit[29]: avoid superfluous memory and disk-footprint (I guess ~30 %) Benefit[30]: don't range-check in-segment index, catch unmappable index by IndexOutOfBoundsException

(24) Additinally truncate leading unmappables in c2b segments, and host offsets: Benefit[31]: avoid another superfluous memory and disk-footprint (I guess ~10 %)
   Disadvantage[21]: needs hosting of offsets: 256 bytes

(25) Concerning (23),(24): Check out best segment size (maybe 256 is not optimal): Benefit[32]: avoid another superfluous memory and disk-footprint (I guess 10-20 %)

(26) Concerning (22),(23),(24): maybe use 3-dim array and check out best segment size (maybe 10 bit is not optimal): Benefit[33]: avoid another superfluous memory and disk-footprint (I guess 10-20 %)

(27) Save Plane no. as 0x0, 0x2 .. 0x7 and 0xf:
   Benefit[34]: simplify calculation of 2nd byte, increases performance

(28) Save 2nd byte in c2bPlane directly (0xa2 .. 0xa7 and 0xaf) instead of Plane no.:
   Benefit[35]: save calculation of 2nd byte, increases performance
   Disadvantage[22]: increases c2bPlane by ~73%

-Ulf


EUC_TW statistics (updated):

Plane   range   length  segments  segments-usage-ratio

0    a1a1-fdcb   5868   5d = 93   66 %
_0   a1a1-a744    434    7 = 7    65 %
_1   c2a1-fdcb   5434   3c = 60   95 %

1:8ea2   -f2c4   7650   52 = 82   98 %
2:8ea3   -e7aa   6394   47 = 71   95 %
3:8ea4   -eedc   7286   4e = 78   98 %
4:8ea5   -fcd1   8601   5c = 92   98 %
5:8ea6   -e4fa   6385   44 = 68   99 %
6:8ea7   -ebd5   6532   4b = 75   98 %
7:8eaf   -edb9   8721   4d = 77   92 %

Sum:             55446  262 = 610

max b1 range: 5d = 93
max b2 range: 5e = 94

memory amount for all segments (not truncated):
610 * 94 = 57,340 code points
truncated -4 % :  ~55,000 code points
decoder surrogate mapping (*3):  ~165,000 bytes

disk-footprint of EUC_TWMapping (1. Approach from Sherman):
b2c           : 8 * 94 * 94 * 2.97 = 209,943 Bytes
b2cIsSuppStr  : 94 * 94 * 1.48     =  13,077
c2bIndex      : 256 * 7            =   1,792
c2bSuppIndex  : 256 * 7            =   1,792
Sum                                 ~227,000 Bytes

memory of EUC_TW (1. Approach from Sherman):
b2c           : 8 * 94 * 94 * 2 = 141,376 Bytes
b2cIsSupp     : 94 * 94         =   8,836
decoder sum   :                   150,212
c2b           : 31744 * 2       =  63,488
c2bIndex      : 256 * 2         =     512
c2bSupp       : 43520 * 2       =  87,040
c2bSuppIndex  : 256 * 2         =     512
c2bPlane      : 43520 * 1       =  43,520
encoder sum   :                   195,072



Reply via email to