Unicode database update

Paul Fisher Sun, 26 Jul 1998 00:30:03 -0400
Some of you sent me email complaining about the proposed size of the
Unicode database for java.lang.Character.  Complain.  Complain.
Complain. :)

So here's the new format, which _should_ satisfy the majority of you.
Kevin Kelley has offered to look into possibly squeezing out a few
more bytes after I get this design implemented (ie. I need to move
onto other classes and more important things).

$classpath Unicode Database
---------------------------
java.lang.Character allows one to retrieve information on all 65536
characters of the Unicode character set.  This is a lot of data.  The
database specification outlined here is meant to be fast, small, and
seamlessly upgradable to new versions of the Unicode specification, by
running a script on the data files that the Unicode Consortium
distributes.

The database consists of three files:
1) character.uni (main database of character attributes)
2) block.uni (mappings from each block to offset in char file)
3) titlecase.uni (list of characters where titlecase differs from uppercase)

File sizes for Unicode 2.1.2 spec
---------------------------------
character.uni: 59310 bytes
block.uni    :  1836 bytes
titlecase.uni:    16 bytes

All quantities are unsigned unless otherwise specified.
All quantities are stored in big endian format.

character.uni
-------------
Each character in the Unicode specification has an entry in the
character.uni file.  Characters are stored sequentially, and there are
no null entries.  Each entry consists of 9 bytes.  Entries are stored
sequentially, based on the Unicode character number.

C = Category
B = Subset Block
N = Numerical Decimal Value
    (65536 if unused, 65535 if not representable as nonnegative integer value)
D = Decimal Digit Value (unused if (J == 0 && Z == 0))
J = isDigit (Java definition) (1/0)
Z = has single decomp which (isDigit == true)
    (in which case the decomp's decimal digit value is in DDDD) (1/0)
I = isIndentifierIgnorable (1/0)
U = Uppercase mapping (0 = no mapping)
L = Lowercase mapping (0 = no mapping)
x = Empty

 JZICCCCC   xxxxDDDD   NNNNNNNN   NNNNNNNN   BBBBBBBB   UUUUUUUU   UUUUUUUU
\________/ \________/ \________/ \________/ \________/ \________/ \________/
  byte 8     byte 7     byte 6     byte 5     byte 4     byte 3     byte 2

 LLLLLLLL   LLLLLLLL
\________/ \________/
  byte 1     byte 0


ranges of values in Unicode 2.1.2 spec

C = 1..28 (Sun uses 0..28, and skips 17, so that's what we do too)
B = 1..69
N = 0..10000
D = 0..9

Current database size (6590 characters - 2.1.2 spec) = 59310 bytes.
Maximum database size (65536 characters) = 589824 bytes.

block.uni
---------
Characters within the Unicode specification tend to come in blocks --
sets of sequential characters.  The Unicode 2.1.2 specification
contains 306 blocks.  The $classpath Unicode database takes advantage
of this property.  Each entry in the block.uni file consists of 6
bytes.  Entries are stored sequentially, based on the Unicode
character number.

U = Unicode character which represents start of block
O = Offset of this block within the char.uni file

 UUUUUUUU   UUUUUUUU   OOOOOOOO   OOOOOOOO   OOOOOOOO   OOOOOOOO
\________/ \________/ \________/ \________/ \________/ \________/
  byte 5     byte 4     byte 3     byte 2     byte 1     byte 0

titlecase.uni
-------------
Characters in which the titlecase differs from the uppercase are
stored in titlecase.uni.  There are only four characters in the
Unicode 2.1.2 specification which fit this description, and it's
doubtful that any others will ever be added to the specification.
However, we should be able to support more, without changing
java.lang.Character, and this is why we have not hardcoded these
values.  Each entry is 4 bytes.  Entries are stored sequentially,
based on the Unicode character number.

U = Unicode character which has a titlecase
T = Unicode mapping to titlecase

 UUUUUUUU   UUUUUUUU   TTTTTTTT   TTTTTTTT
\________/ \________/ \________/ \________/
  byte 3     byte 2     byte 1     byte 0


Finding an entry for character ``U''.
-------------------------------------
* Read in titlecase information beforehand
* Read in the block.uni file beforehand
* Locate the proper block for character U in the block.uni file.
    if U > current_block && U < next_block then
      found_block
    if current_block_offset + (current_block-U)*9 > next_block_offset
      U is not defined
* Read in U
    U is at current_block_offset + (current_block-U)*9
Unicode database update

Reply via email to