A revised, simplified file format proposal based on my original
sketch, some of Markus's ideas for "NCF", and an evaluation of which
optimizations were likely to benefit actual font data.

Definitions:

All numeric fields are variable length coded, using the high bit of
each byte as a continuation bit. This is something like UTF-8 except
more efficient since we don't need synchronization or ASCII-compat
properties.

Many fields are such that large values would be nonsense anyway, but
everything is vlc just for consistency. I have marked the fields which
should not be more than 1 byte as such.

All offsets are displacements relative to the byte immediately
following the vlc in which the offset is stored. Thus an offset of 0
means the data being pointed to immediately follows the vlc.

File header:
- magic number:     8 bytes
- version:          vlc
- cell width:       vlc (1 byte)
- cell height:      vlc (1 byte)
- tree depth:       vlc (1 byte)
- tree offset:      vlc
- context count:    vlc
- context spacing:  vlc (1 byte)
- offsets to individual context structures, padded with junk if
  necessary so that successive offset vlcs are spaced context_spacing
  bytes apart from one another.

Tree nodes:
- pivot:            vlc
- right offset:     vlc

The left child of each node begins immediately following the last
field of the node structure. The right child is indexed by the right
offset vlc. Pivot is the first codepoint that falls on the right
child, expressed relative to the parent node's pivot element (which is
implicitly 0 at the root).

After following the full depth of the tree as specified in the header
(which may be 0, i.e. degenerate), the resulting position contains a
block. Blocks represent (hopefully) contiguous ranges of (hopefully)
uniform-width characters. The block structure itself has exactly one
field, a vlc "stride" indicating the difference between successive
character record positions in the block. The first character of the
block begins immediately after the stride vlc. In particular, if
stride is 0, all characters in the block are represented by a single
record. This is particularly useful for representing ranges of
unassigned code points or characters not covered by the font.

A character record begins with an opcode vlc. The following values are
special:
- 0: a glyph to be used immediately follows the vlc
- 1: a vlc offset immediately follows the vlc. if the offset is 0 then
  no glyph is available for the character. otherwise process the
  character record at the specified offset.
- 2-3: reserved
- 4-...: context code from the context table, followed by a vlc
  offset. if the character's context matches the context code, then
  the glyph to be used is located at the specified offset. otherwise,
  process the character record immediately following the offset vlc.

In all cases, glyph data is just the raw scanline bits.


In general, a well-formed file looks like:

1. Header

2. Context tables

3. Binary tree with 4-8 levels to avoid the need for a huge (million
   entry) table of character record pointers, weed out empty
   codepoints and group double-width characters together.

4. Within each tree leaf, either fixed-size records for each character
   or a list of pointers to the actual character records. The program
   writing the file can select which works best for its purposes.

5. Character records which select a glyph based on the table of
   contexts. These can be essentially degenerate for characters
   represented by exactly one glyph.

6. Glyphs can be stored alongside the last of the character records
   they service (which helps save some space on offsets) or at the end
   of the file, or anywhere else you can manage to put them.

Nice properties:

Computational overhead is very low. Finding a glyph to match a
character and context requires one conditional and "depth" memory
accesses to reach the block head, a small amout of arithmetic to reach
the character record, and then whatever conditionals the context rules
require (just one for ligature-free characters). Cost rises somewhat
if you want to tolerate corrupt font files gracefully but remains
inexpensive.

Memory overhead is low. In addition, while not required it is
certainly possible and encouraged that related glyphs be kept close
together. This will promote efficient demand-paging so that scripts
not in use do not consume memory. The format was designed so that it
can be parsed on demand while mmapped, without the need to predecode
and cache large tables in process-local memory. Embedded systems may
use a compressed filesystem or emulate something similar in userspace.

Assuming a source format of the form "collection of glyphs with each
glyph listing which characters it can represent and in what contexts
it can do so with priorities assigned to each context to resolve
conflicts", the above mentioned binary format can be transformed back
and forth to and from the source format without loss of important
information. In particular, the notion that a single glyph is
representing several characters need not be lost. Other attributes
such as a descriptive name for the glyph or comments may of course be
lost however.

Expected size for a complete BMP font at 8x16 is 1.7-2 megs. Omitting
Traditional Chinese could reduce this (how much? what portion of the
CJK characters are Traditional-only?) and leave most users happy.
Omitting CJK entirely leaves the expected size around 200-250k.

Known Limitations:

Glyph width is stored nowhere in the font file. This is intentional
because the font does not get to decide on character width; it must
match wcwidth(). Yet it does require the user to have a working
wcwidth implementation supporting any character that will be used. In
the absence of such a routine, the font file is still parsable and
valid but display will be incorrect. One possible "fix" for this is to
use separate left-half and right-half "glyphs" for double-width
characters and define contexts of "left half" and "rigth half". This
approach may more accurately reflect the needs of applications anyway.
It's also possible that full-width versions of ordinary narrow
combining marks may be needed for use on CJK characters...? Can anyone
comment on this?

Variable-width fonts are not supported. Honestly I am not interested
in them since variable-width bitmap fonts have no practical modern
applications; printing and pretty layout like a web browser might due
both require high-quality scalable fonts. I do have some ideas for how
variable-width fonts could be supported if anyone really cares for
them.

References to common glyphs can only be forward references, not back
references. If this is a problem a signed vlc could be used for the
glyph offset. The chaining "opcode 1" is intentionally unsigned
however to make infinite loops impossible so that an implementation
does not have to waste complexity trying to detect them. It's always
possible to store the glyphs such that forward references are
sufficient, but for the sake of optimizing the number of pages mapped
into ram, it may be desirable to locate a shared glyph with the more
common range it is used in. However it's unlikely that an automated
font compiler would know how to do this anyway.

Context format for ligature rules has not been specified, pending
expert advice on the matter. :)

Rich





--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to