Comments and suggestions (if you see any problems with this design)
are requested.
I'm currently working on java.lang.Character. It's a rather annoying
class, because it allows one to retrieve detailed information on the
entire Unicode character set (65536 chars) -- things were so much
simpler back in the 1 byte char days.
I've designed an implementation that is compatible with Java 1.2,
Unicode 2.1.2, and can be _seamlessly_ upgraded to new versions of the
Unicode specification.
Using all the magic tables which the Unicode Consortium releases, I
have written a Perl script which churns out a fixed-size record
database. Each Unicode character consumes 5 bytes of attribute
information, and character entries are laid out sequentially. The
resulting database of attribute information is 327680 bytes. To find
a particular character's attributes in the database, you just seek
(char_value*5) bytes. I could make things easier on myself (and
slightly speed things up) by using 7 bytes of data (lowercase and
uppercase/titlecase mappings would each get a full 16 bits), but then
the table would jump to 458752 bytes, and I'm not so sure it's worth
the extra space.
Here's the format of each record in the database:
C = Category
B = Subset Block
N = Numerical Decimal Value
D = Decimal Digit Value
Z = isDigit
I = isIndentifierIgnorable
S = 0 = Uppercase mapping, 1 = Titlecase mapping
U = Uppercase/Titlecase mapping (offset from current char)
L = Lowercase mapping (offset from current char)
x = Empty
xZICCCCC xDDDDBBB BBBBSUUU UUUUUUUL LLLLLLLL
\________/ \________/ \________/ \________/ \________/
byte 4 byte 3 byte 2 byte 1 byte 0
ranges of values in Unicode 2.1.2 spec
C = 1..28
B = 1..69
N = 0..10000
D = 0..9
U = -128..300
L = -219..214
Total database size for 65536 characters = 327680 bytes.
--
Paul Fisher * [EMAIL PROTECTED]