Some of you sent me email complaining about the proposed size of the
Unicode database for java.lang.Character. Complain. Complain.
Complain. :)
So here's the new format, which _should_ satisfy the majority of you.
Kevin Kelley has offered to look into possibly squeezing out a few
more bytes after I get this design implemented (ie. I need to move
onto other classes and more important things).
$classpath Unicode Database
---------------------------
java.lang.Character allows one to retrieve information on all 65536
characters of the Unicode character set. This is a lot of data. The
database specification outlined here is meant to be fast, small, and
seamlessly upgradable to new versions of the Unicode specification, by
running a script on the data files that the Unicode Consortium
distributes.
The database consists of three files:
1) character.uni (main database of character attributes)
2) block.uni (mappings from each block to offset in char file)
3) titlecase.uni (list of characters where titlecase differs from uppercase)
File sizes for Unicode 2.1.2 spec
---------------------------------
character.uni: 59310 bytes
block.uni : 1836 bytes
titlecase.uni: 16 bytes
All quantities are unsigned unless otherwise specified.
All quantities are stored in big endian format.
character.uni
-------------
Each character in the Unicode specification has an entry in the
character.uni file. Characters are stored sequentially, and there are
no null entries. Each entry consists of 9 bytes. Entries are stored
sequentially, based on the Unicode character number.
C = Category
B = Subset Block
N = Numerical Decimal Value
(65536 if unused, 65535 if not representable as nonnegative integer value)
D = Decimal Digit Value (unused if (J == 0 && Z == 0))
J = isDigit (Java definition) (1/0)
Z = has single decomp which (isDigit == true)
(in which case the decomp's decimal digit value is in DDDD) (1/0)
I = isIndentifierIgnorable (1/0)
U = Uppercase mapping (0 = no mapping)
L = Lowercase mapping (0 = no mapping)
x = Empty
JZICCCCC xxxxDDDD NNNNNNNN NNNNNNNN BBBBBBBB UUUUUUUU UUUUUUUU
\________/ \________/ \________/ \________/ \________/ \________/ \________/
byte 8 byte 7 byte 6 byte 5 byte 4 byte 3 byte 2
LLLLLLLL LLLLLLLL
\________/ \________/
byte 1 byte 0
ranges of values in Unicode 2.1.2 spec
C = 1..28 (Sun uses 0..28, and skips 17, so that's what we do too)
B = 1..69
N = 0..10000
D = 0..9
Current database size (6590 characters - 2.1.2 spec) = 59310 bytes.
Maximum database size (65536 characters) = 589824 bytes.
block.uni
---------
Characters within the Unicode specification tend to come in blocks --
sets of sequential characters. The Unicode 2.1.2 specification
contains 306 blocks. The $classpath Unicode database takes advantage
of this property. Each entry in the block.uni file consists of 6
bytes. Entries are stored sequentially, based on the Unicode
character number.
U = Unicode character which represents start of block
O = Offset of this block within the char.uni file
UUUUUUUU UUUUUUUU OOOOOOOO OOOOOOOO OOOOOOOO OOOOOOOO
\________/ \________/ \________/ \________/ \________/ \________/
byte 5 byte 4 byte 3 byte 2 byte 1 byte 0
titlecase.uni
-------------
Characters in which the titlecase differs from the uppercase are
stored in titlecase.uni. There are only four characters in the
Unicode 2.1.2 specification which fit this description, and it's
doubtful that any others will ever be added to the specification.
However, we should be able to support more, without changing
java.lang.Character, and this is why we have not hardcoded these
values. Each entry is 4 bytes. Entries are stored sequentially,
based on the Unicode character number.
U = Unicode character which has a titlecase
T = Unicode mapping to titlecase
UUUUUUUU UUUUUUUU TTTTTTTT TTTTTTTT
\________/ \________/ \________/ \________/
byte 3 byte 2 byte 1 byte 0
Finding an entry for character ``U''.
-------------------------------------
* Read in titlecase information beforehand
* Read in the block.uni file beforehand
* Locate the proper block for character U in the block.uni file.
if U > current_block && U < next_block then
found_block
if current_block_offset + (current_block-U)*9 > next_block_offset
U is not defined
* Read in U
U is at current_block_offset + (current_block-U)*9