Re: Unicode script support in Regex and Character class

Ulf Zibis Fri, 23 Apr 2010 19:38:08 -0700

Am 24.04.2010 01:09, schrieb Xueming Shen:

Ulf Zibis wrote:
- I like the idea, saving the data in a compressed binary file,instead classfile static data.- wouldn't PreHashMaps be faster initialized as a normal HashMaps inj.l.Character.UnicodeScript and j.l.CharacterName?
I don't think so. The key for these 2 cases is the whole unicode range.


At least the aliases map has string keys.

But you can always try. Current
binary-search for script definitely is not a perfect solution.

In most cases you don't have an exact match from the HashMap ofCharacterName, so then you anyway have to do the binary search.

- As alternative to lookup in a hash table, I guess retrieving thepointers from a memory saving sorted array via binary search would befast enough.
- j.l.CharacterName:
-- You could instantiate the HashMap with capacity=cpLeng
I changed the data file "format" a bit, so now the overal uniName.datis less than 88k (last version is 122+k), but


Is this compressed size or un-compressed ?

the I can no long use cpLen as the capacity for the hashmap. I'm nowusing a hardcoded 20000 for 5.2.

You could pre-calculate the actual value by help ofgeneratecharacter/CharacterName.java

I believe you would NOT see any meaningful performance boost fromusing DirectByteBuffer, given the
size of the data file, 88k. It probably will slow it down a little.

If you read the whole file, yes, but retrieving a single data from adistinct position ?

If you take a look at the last version
http://cr.openjdk.java.net/~sherman/script/webrev/src/share/classes/java/lang/CharacterName.java.htmlYou probably will not consider to use DataInputStream class. I nolonger store the code point value formost entries, one the length of the name, in which 1 byte isdefinitely big enough.


You could save one more byte:

  66             do {
  67                 int len = ba[off++]&  0xff;
  68                 if (len<  0x11) {
  69                     // always big-endian
  70                     cp = (len<<  16) |
  71                          ((ba[off++]&  0xff)<<   8) |
  72                          ((ba[off++]&  0xff));
  73                     len = ba[off++]&  0xff;
  74
  75                 }  else {
  76                     len -= 0x11;
  77                     cp++;
  78                 }

Yes, the final table takes about 500k, we might consider to use aweakref or something, if memory reallya concern. But the table will get initialized only if you invokeCharacter.getName(),

Yes, retrieving one single Character.getName() would cause the whole mapto initialize. Is that economic?

I would expect most
of the application would never get down there.
(1) to use enum for the j.l.Character.UnicodeScript (compared to thetraditional j.l.c.Subset)
- enum j.l.Character.UnicodeScript:
-- IIRC, enums internally are handled as int constants, so retrievingan element via name would need a name->int lookup
-- So UnicodeScript.forName would have to lookup 2 times
--- alias->fullName (name of enum element)
--- fullName->internal int constant
-- I suggest to add the full names to the aliasses map and onlylookup once.
Not really. It's not alias->fullName, it's alias->UnicodeScriptcostant. So if the passed in is an alias, then
we don't do the second lookup.

This I wanted to say, sorry about not being more detailed.

That said, it's always a trade-off of memory use and speed. To put all
full name in aliases map definitely will reduce the second lookup ifthe passed in is a canonical name, withthe price of having name entries in both alias map and enum's internalhashmap.


~100 * (4 + 4) bytes against the above 500.000 bytes, does that matter ?

I really don't know which
one is a better choice. I did it this way with the assumption thelookup for script name is not critical. I
might be wrong.
-- Why don't you use Arrays.binarySearch in UnicodeScript.of(intcodePoint) ?
why? I don't know:-) Maybe the copy/paste from UnicodeBlock lookup ismore convenient than using
the Arrays.binarySearch. Not a big deal.


So both could use Arrays.binarySearch ;-)

-Ulf

Re: Unicode script support in Regex and Character class

Reply via email to