Re: Unicode script support in Regex and Character class

Xueming Shen Fri, 23 Apr 2010 16:12:07 -0700

Ulf Zibis wrote:

- I like the idea, saving the data in a compressed binary file,instead classfile static data.- wouldn't PreHashMaps be faster initialized as a normal HashMaps inj.l.Character.UnicodeScript and j.l.CharacterName?

I don't think so. The key for these 2 cases is the whole unicode range.But you can always try. Current

binary-search for script definitely is not a perfect solution.

- As alternative to lookup in a hash table, I guess retrieving thepointers from a memory saving sorted array via binary search would befast enough.
- j.l.CharacterName:
-- You could instantiate the HashMap with capacity=cpLeng

I changed the data file "format" a bit, so now the overal uniName.dat isless than 88k (last version is 122+k), butthe I can no long use cpLen as the capacity for the hashmap. I'm nowusing a hardcoded 20000 for 5.2.

-- Is it faster, first copying the whole date in a byte[], and thenusing ByteBuffer.getInt etc. against directly using DataInputStreammethods?-- You could create a very long String with the whole data and thenuse subString for the individual strings which could share the samebacking char[].-- I don't think, it's a good idea, holding the whole data in memory,especiallly as String objects; Additionally the backing char[]'soccupy twice the space than a byte[]-- the big new byte[total] and later the huge amount of String objectscould result in OOM error on small VM heap.-- as compromise, you could put the cp->nameOff pointers in a separatenot-compressed data file, only hold this in memory, or access it viaDirectByteBuffer, and read the string data from separate file only onrequest from Character.getName(int codePoint). As option, a PreHashMapcould cache individual loaded strings.-- Anyway, having DirectByteBuffer access on deflated data would be aperformace/footprint gain.

Sorry, I don't think I fully understand your points here.

I believe you would NOT see any meaningful performance boost from usingDirectByteBuffer, given the

size of the data file, 88k. It probably will slow it down a little.

If you take a look at the last version
http://cr.openjdk.java.net/~sherman/script/webrev/src/share/classes/java/lang/CharacterName.java.html

You probably will not consider to use DataInputStream class. I no longerstore the code point value formost entries, one the length of the name, in which 1 byte is definitelybig enough.

Yes, the final table takes about 500k, we might consider to use aweakref or something, if memory reallya concern. But the table will get initialized only if you invokeCharacter.getName(), I would expect most

of the application would never get down there.

(1) to use enum for the j.l.Character.UnicodeScript (compared to thetraditional j.l.c.Subset)
- enum j.l.Character.UnicodeScript:
-- IIRC, enums internally are handled as int constants, so retrievingan element via name would need a name->int lookup
-- So UnicodeScript.forName would have to lookup 2 times
--- alias->fullName (name of enum element)
--- fullName->internal int constant
-- I suggest to add the full names to the aliasses map and only lookuponce.

Not really. It's not alias->fullName, it's alias->UnicodeScript costant.So if the passed in is an alias, thenwe don't do the second lookup. That said, it's always a trade-off ofmemory use and speed. To put allfull name in aliases map definitely will reduce the second lookup if thepassed in is a canonical name, withthe price of having name entries in both alias map and enum's internalhashmap. I really don't know whichone is a better choice. I did it this way with the assumption the lookupfor script name is not critical. I

might be wrong.

-- Why don't you use Arrays.binarySearch in UnicodeScript.of(intcodePoint) ?

why? I don't know:-) Maybe the copy/paste from UnicodeBlock lookup ismore convenient than using

the Arrays.binarySearch. Not a big deal.

Thanks,
-Sherman

Re: Unicode script support in Regex and Character class

Reply via email to