Re: Unicode script support in Regex and Character class

Ulf Zibis Thu, 29 Apr 2010 11:09:16 -0700

Am 24.04.2010 01:09, schrieb Xueming Shen:

Yes, the final table takes about 500k, we might consider to use a weakref or something, if memory really a concern. But the table will get initialized only if you invoke Character.getName(),


Sherman, how did you compute that value:
- A Map.Entry object counts 24 bytes (40 on 64-bit machine)
- An Integer object for the key counts 12 bytes (20 on 64-bit machine)

- A String object counts 36 + 2*length, so for average character name length of 24:

      84 bytes (98 on 64-bit machine)

--> one character name in HashMap would count including buckets overhead ~135 bytes (~170 on 64-bit machine)

--> 20.000 character names would count ~2.7 MByte (~3.4 on 64-bit machine)


See my new version in attachment.

I estimate:
- for byte[] names: 480.000 bytes
- for int[][] indexes:
-- base array size with 4353 elements: 17.420 bytes
-- one int[] index for block with average length of 32: 140 bytes
-- sum: 626.700 bytes
over all sum: 1.106.700 bytes (pretty enough)

If the block offset would be smaller than 256, I guess it would be more less. (with the impact of little decreased performance)

- Initializing the indexes array should be *much* faster than filling the hash map. - Retrieving an index should be little faster or equivalent, but instantiation of one new String object must be added.


We could go further:
- separate caches (and data files) for the 17 Unicode planes

- calculate short 1/2-byte keys for textual words and frequent phrases. I estimate, there are 1000..4000 different words and 100..300 redundant phrases in the data.

Are you interested in that ?

We could add (Attention: CCC change) a cacheCharacterNames(boolean yesNo) method to serve users, which excessively need this functionality.


-Ulf

CharacterName2.java
Description: java/

Re: Unicode script support in Regex and Character class

Reply via email to