Am 24.04.2010 01:09, schrieb Xueming Shen:
Ulf Zibis wrote:
- I like the idea, saving the data in a compressed binary file,
instead classfile static data.
- wouldn't PreHashMaps be faster initialized as a normal HashMaps in
j.l.Character.UnicodeScript and j.l.CharacterName?
I don't think so. The key for these 2 cases is the whole unicode range.
At least the aliases map has string keys.
But you can always try. Current
binary-search for script definitely is not a perfect solution.
In most cases you don't have an exact match from the HashMap of
CharacterName, so then you anyway have to do the binary search.
- As alternative to lookup in a hash table, I guess retrieving the
pointers from a memory saving sorted array via binary search would be
fast enough.
- j.l.CharacterName:
-- You could instantiate the HashMap with capacity=cpLeng
I changed the data file "format" a bit, so now the overal uniName.dat
is less than 88k (last version is 122+k), but
Is this compressed size or un-compressed ?
the I can no long use cpLen as the capacity for the hashmap. I'm now
using a hardcoded 20000 for 5.2.
You could pre-calculate the actual value by help of
generatecharacter/CharacterName.java
I believe you would NOT see any meaningful performance boost from
using DirectByteBuffer, given the
size of the data file, 88k. It probably will slow it down a little.
If you read the whole file, yes, but retrieving a single data from a
distinct position ?
If you take a look at the last version
http://cr.openjdk.java.net/~sherman/script/webrev/src/share/classes/java/lang/CharacterName.java.html
You probably will not consider to use DataInputStream class. I no
longer store the code point value for
most entries, one the length of the name, in which 1 byte is
definitely big enough.
You could save one more byte:
66 do {
67 int len = ba[off++]& 0xff;
68 if (len< 0x11) {
69 // always big-endian
70 cp = (len<< 16) |
71 ((ba[off++]& 0xff)<< 8) |
72 ((ba[off++]& 0xff));
73 len = ba[off++]& 0xff;
74
75 } else {
76 len -= 0x11;
77 cp++;
78 }
Yes, the final table takes about 500k, we might consider to use a
weakref or something, if memory really
a concern. But the table will get initialized only if you invoke
Character.getName(),
Yes, retrieving one single Character.getName() would cause the whole map
to initialize. Is that economic?
I would expect most
of the application would never get down there.
(1) to use enum for the j.l.Character.UnicodeScript (compared to the
traditional j.l.c.Subset)
- enum j.l.Character.UnicodeScript:
-- IIRC, enums internally are handled as int constants, so retrieving
an element via name would need a name->int lookup
-- So UnicodeScript.forName would have to lookup 2 times
--- alias->fullName (name of enum element)
--- fullName->internal int constant
-- I suggest to add the full names to the aliasses map and only
lookup once.
Not really. It's not alias->fullName, it's alias->UnicodeScript
costant. So if the passed in is an alias, then
we don't do the second lookup.
This I wanted to say, sorry about not being more detailed.
That said, it's always a trade-off of memory use and speed. To put all
full name in aliases map definitely will reduce the second lookup if
the passed in is a canonical name, with
the price of having name entries in both alias map and enum's internal
hashmap.
~100 * (4 + 4) bytes against the above 500.000 bytes, does that matter ?
I really don't know which
one is a better choice. I did it this way with the assumption the
lookup for script name is not critical. I
might be wrong.
-- Why don't you use Arrays.binarySearch in UnicodeScript.of(int
codePoint) ?
why? I don't know:-) Maybe the copy/paste from UnicodeBlock lookup is
more convenient than using
the Arrays.binarySearch. Not a big deal.
So both could use Arrays.binarySearch ;-)
-Ulf