RE: Unicode database update

John Keiser Sun, 26 Jul 1998 11:20:03 -0400
It's great that you got it down to that size!  I guess it relates to not all
of the possible characters being used in the Unicode spec?
--John Keiser

> -----Original Message-----
> From: Paul Fisher [mailto:[EMAIL PROTECTED]]On Behalf Of Paul Fisher
> Sent: Saturday, July 25, 1998 9:18 PM
> To: [EMAIL PROTECTED]
> Subject: Unicode database update
>
>
> Some of you sent me email complaining about the proposed size of the
> Unicode database for java.lang.Character.  Complain.  Complain.
> Complain. :)
>
> So here's the new format, which _should_ satisfy the majority of you.
> Kevin Kelley has offered to look into possibly squeezing out a few
> more bytes after I get this design implemented (ie. I need to move
> onto other classes and more important things).
>
> $classpath Unicode Database
> ---------------------------
> java.lang.Character allows one to retrieve information on all 65536
> characters of the Unicode character set.  This is a lot of data.  The
> database specification outlined here is meant to be fast, small, and
> seamlessly upgradable to new versions of the Unicode specification, by
> running a script on the data files that the Unicode Consortium
> distributes.
>
> The database consists of three files:
> 1) character.uni (main database of character attributes)
> 2) block.uni (mappings from each block to offset in char file)
> 3) titlecase.uni (list of characters where titlecase differs from
> uppercase)
>
> File sizes for Unicode 2.1.2 spec
> ---------------------------------
> character.uni: 59310 bytes
> block.uni    :  1836 bytes
> titlecase.uni:    16 bytes
>
> All quantities are unsigned unless otherwise specified.
> All quantities are stored in big endian format.
>
> character.uni
> -------------
> Each character in the Unicode specification has an entry in the
> character.uni file.  Characters are stored sequentially, and there are
> no null entries.  Each entry consists of 9 bytes.  Entries are stored
> sequentially, based on the Unicode character number.
>
> C = Category
> B = Subset Block
> N = Numerical Decimal Value
>     (65536 if unused, 65535 if not representable as nonnegative
> integer value)
> D = Decimal Digit Value (unused if (J == 0 && Z == 0))
> J = isDigit (Java definition) (1/0)
> Z = has single decomp which (isDigit == true)
>     (in which case the decomp's decimal digit value is in DDDD) (1/0)
> I = isIndentifierIgnorable (1/0)
> U = Uppercase mapping (0 = no mapping)
> L = Lowercase mapping (0 = no mapping)
> x = Empty
>
>  JZICCCCC   xxxxDDDD   NNNNNNNN   NNNNNNNN   BBBBBBBB   UUUUUUUU
>  UUUUUUUU
> \________/ \________/ \________/ \________/ \________/ \________/
> \________/
>   byte 8     byte 7     byte 6     byte 5     byte 4     byte 3     byte 2
>
>  LLLLLLLL   LLLLLLLL
> \________/ \________/
>   byte 1     byte 0
>
>
> ranges of values in Unicode 2.1.2 spec
>
> C = 1..28 (Sun uses 0..28, and skips 17, so that's what we do too)
> B = 1..69
> N = 0..10000
> D = 0..9
>
> Current database size (6590 characters - 2.1.2 spec) = 59310 bytes.
> Maximum database size (65536 characters) = 589824 bytes.
>
> block.uni
> ---------
> Characters within the Unicode specification tend to come in blocks --
> sets of sequential characters.  The Unicode 2.1.2 specification
> contains 306 blocks.  The $classpath Unicode database takes advantage
> of this property.  Each entry in the block.uni file consists of 6
> bytes.  Entries are stored sequentially, based on the Unicode
> character number.
>
> U = Unicode character which represents start of block
> O = Offset of this block within the char.uni file
>
>  UUUUUUUU   UUUUUUUU   OOOOOOOO   OOOOOOOO   OOOOOOOO   OOOOOOOO
> \________/ \________/ \________/ \________/ \________/ \________/
>   byte 5     byte 4     byte 3     byte 2     byte 1     byte 0
>
> titlecase.uni
> -------------
> Characters in which the titlecase differs from the uppercase are
> stored in titlecase.uni.  There are only four characters in the
> Unicode 2.1.2 specification which fit this description, and it's
> doubtful that any others will ever be added to the specification.
> However, we should be able to support more, without changing
> java.lang.Character, and this is why we have not hardcoded these
> values.  Each entry is 4 bytes.  Entries are stored sequentially,
> based on the Unicode character number.
>
> U = Unicode character which has a titlecase
> T = Unicode mapping to titlecase
>
>  UUUUUUUU   UUUUUUUU   TTTTTTTT   TTTTTTTT
> \________/ \________/ \________/ \________/
>   byte 3     byte 2     byte 1     byte 0
>
>
> Finding an entry for character ``U''.
> -------------------------------------
> * Read in titlecase information beforehand
> * Read in the block.uni file beforehand
> * Locate the proper block for character U in the block.uni file.
>     if U > current_block && U < next_block then
>       found_block
>     if current_block_offset + (current_block-U)*9 > next_block_offset
>       U is not defined
> * Read in U
>     U is at current_block_offset + (current_block-U)*9
>
>
RE: Unicode database update

Reply via email to