Re: Rewrite of IBM doublebyte charsets

Ulf Zibis Thu, 14 May 2009 13:13:50 -0700

Now I have time to answer more detailed ...

Am 12.05.2009 08:30, Xueming Shen schrieb:

For (2), I'm not convinced that this approach is an appropriate onefor a complicated charset like EUC_TW,given the number of array it carries, the recovery work (to trace backto what goes wrong and then return the
appropriate CoderResult) will be complicated and redundant...).

Well, checking the range twice is also redundant (It's additionallychecked behind the scenes on every array access by JVM).

This might have a benefit of saving the range
check (I don't have any data to show how much we can gain from doingthis, only a guess), but given almost allsegments are near "full", I don't see the benefit on the footprintsaving side. We need some hard data to supportthis approach, which I don't have for now. I would leave this one foryou for further optimization in your project.

Yes, that's good idea. I would be happy, if it would be launched in thenear future ...

I have updated the webrev to address some of your other optimizationsuggestions


Happy to see that. :-)

(1)No I don't think we want to save the supplementary into surrogatepair, this is what I'm trying to fix. We don'tcare the performance of surrogates, those codepoints are RARE used,99%+ coding/decoding happens inBMP, we did not have the supplementary characters for the first coupleyears. (OK, I'm a native, I don't think
I can even read those characters)

This is, what I didn't know. My assumption was, that those supplementarycharacters would be regularly used, as they are 137 % of BMP chars count.But if they are so rare used, wouldn't it be reasonable to split themapping into 2 chunks, or even 3 chunks, having a base-chunk of about~10 % of BMP. Your native status would help to discover those ~10 %. ;-)

Well, such optimization would ideally placed in the mentioned project.

(2)The initialization c2b data for encoder has already been "lazied"until Encoder class gets loaded.


Oops, I oversaw this fact. ;-)


-Ulf

Re: Rewrite of IBM doublebyte charsets

Reply via email to