It's great that you got it down to that size! I guess it relates to not all
of the possible characters being used in the Unicode spec?
--John Keiser
> -----Original Message-----
> From: Paul Fisher [mailto:[EMAIL PROTECTED]]On Behalf Of Paul Fisher
> Sent: Saturday, July 25, 1998 9:18 PM
> To: [EMAIL PROTECTED]
> Subject: Unicode database update
>
>
> Some of you sent me email complaining about the proposed size of the
> Unicode database for java.lang.Character. Complain. Complain.
> Complain. :)
>
> So here's the new format, which _should_ satisfy the majority of you.
> Kevin Kelley has offered to look into possibly squeezing out a few
> more bytes after I get this design implemented (ie. I need to move
> onto other classes and more important things).
>
> $classpath Unicode Database
> ---------------------------
> java.lang.Character allows one to retrieve information on all 65536
> characters of the Unicode character set. This is a lot of data. The
> database specification outlined here is meant to be fast, small, and
> seamlessly upgradable to new versions of the Unicode specification, by
> running a script on the data files that the Unicode Consortium
> distributes.
>
> The database consists of three files:
> 1) character.uni (main database of character attributes)
> 2) block.uni (mappings from each block to offset in char file)
> 3) titlecase.uni (list of characters where titlecase differs from
> uppercase)
>
> File sizes for Unicode 2.1.2 spec
> ---------------------------------
> character.uni: 59310 bytes
> block.uni : 1836 bytes
> titlecase.uni: 16 bytes
>
> All quantities are unsigned unless otherwise specified.
> All quantities are stored in big endian format.
>
> character.uni
> -------------
> Each character in the Unicode specification has an entry in the
> character.uni file. Characters are stored sequentially, and there are
> no null entries. Each entry consists of 9 bytes. Entries are stored
> sequentially, based on the Unicode character number.
>
> C = Category
> B = Subset Block
> N = Numerical Decimal Value
> (65536 if unused, 65535 if not representable as nonnegative
> integer value)
> D = Decimal Digit Value (unused if (J == 0 && Z == 0))
> J = isDigit (Java definition) (1/0)
> Z = has single decomp which (isDigit == true)
> (in which case the decomp's decimal digit value is in DDDD) (1/0)
> I = isIndentifierIgnorable (1/0)
> U = Uppercase mapping (0 = no mapping)
> L = Lowercase mapping (0 = no mapping)
> x = Empty
>
> JZICCCCC xxxxDDDD NNNNNNNN NNNNNNNN BBBBBBBB UUUUUUUU
> UUUUUUUU
> \________/ \________/ \________/ \________/ \________/ \________/
> \________/
> byte 8 byte 7 byte 6 byte 5 byte 4 byte 3 byte 2
>
> LLLLLLLL LLLLLLLL
> \________/ \________/
> byte 1 byte 0
>
>
> ranges of values in Unicode 2.1.2 spec
>
> C = 1..28 (Sun uses 0..28, and skips 17, so that's what we do too)
> B = 1..69
> N = 0..10000
> D = 0..9
>
> Current database size (6590 characters - 2.1.2 spec) = 59310 bytes.
> Maximum database size (65536 characters) = 589824 bytes.
>
> block.uni
> ---------
> Characters within the Unicode specification tend to come in blocks --
> sets of sequential characters. The Unicode 2.1.2 specification
> contains 306 blocks. The $classpath Unicode database takes advantage
> of this property. Each entry in the block.uni file consists of 6
> bytes. Entries are stored sequentially, based on the Unicode
> character number.
>
> U = Unicode character which represents start of block
> O = Offset of this block within the char.uni file
>
> UUUUUUUU UUUUUUUU OOOOOOOO OOOOOOOO OOOOOOOO OOOOOOOO
> \________/ \________/ \________/ \________/ \________/ \________/
> byte 5 byte 4 byte 3 byte 2 byte 1 byte 0
>
> titlecase.uni
> -------------
> Characters in which the titlecase differs from the uppercase are
> stored in titlecase.uni. There are only four characters in the
> Unicode 2.1.2 specification which fit this description, and it's
> doubtful that any others will ever be added to the specification.
> However, we should be able to support more, without changing
> java.lang.Character, and this is why we have not hardcoded these
> values. Each entry is 4 bytes. Entries are stored sequentially,
> based on the Unicode character number.
>
> U = Unicode character which has a titlecase
> T = Unicode mapping to titlecase
>
> UUUUUUUU UUUUUUUU TTTTTTTT TTTTTTTT
> \________/ \________/ \________/ \________/
> byte 3 byte 2 byte 1 byte 0
>
>
> Finding an entry for character ``U''.
> -------------------------------------
> * Read in titlecase information beforehand
> * Read in the block.uni file beforehand
> * Locate the proper block for character U in the block.uni file.
> if U > current_block && U < next_block then
> found_block
> if current_block_offset + (current_block-U)*9 > next_block_offset
> U is not defined
> * Read in U
> U is at current_block_offset + (current_block-U)*9
>
>