Now I have time to answer more detailed ...
Am 12.05.2009 08:30, Xueming Shen schrieb:
For (2), I'm not convinced that this approach is an appropriate one
for a complicated charset like EUC_TW,
given the number of array it carries, the recovery work (to trace back
to what goes wrong and then return the
appropriate CoderResult) will be complicated and redundant...).
Well, checking the range twice is also redundant (It's additionally
checked behind the scenes on every array access by JVM).
This might have a benefit of saving the range
check (I don't have any data to show how much we can gain from doing
this, only a guess), but given almost all
segments are near "full", I don't see the benefit on the footprint
saving side. We need some hard data to support
this approach, which I don't have for now. I would leave this one for
you for further optimization in your project.
Yes, that's good idea. I would be happy, if it would be launched in the
near future ...
I have updated the webrev to address some of your other optimization
suggestions
Happy to see that. :-)
(1)No I don't think we want to save the supplementary into surrogate
pair, this is what I'm trying to fix. We don't
care the performance of surrogates, those codepoints are RARE used,
99%+ coding/decoding happens in
BMP, we did not have the supplementary characters for the first couple
years. (OK, I'm a native, I don't think
I can even read those characters)
This is, what I didn't know. My assumption was, that those supplementary
characters would be regularly used, as they are 137 % of BMP chars count.
But if they are so rare used, wouldn't it be reasonable to split the
mapping into 2 chunks, or even 3 chunks, having a base-chunk of about
~10 % of BMP. Your native status would help to discover those ~10 %. ;-)
Well, such optimization would ideally placed in the mentioned project.
(2)The initialization c2b data for encoder has already been "lazied"
until Encoder class gets loaded.
Oops, I oversaw this fact. ;-)
-Ulf