A revised, simplified file format proposal based on my original sketch, some of Markus's ideas for "NCF", and an evaluation of which optimizations were likely to benefit actual font data.
Definitions: All numeric fields are variable length coded, using the high bit of each byte as a continuation bit. This is something like UTF-8 except more efficient since we don't need synchronization or ASCII-compat properties. Many fields are such that large values would be nonsense anyway, but everything is vlc just for consistency. I have marked the fields which should not be more than 1 byte as such. All offsets are displacements relative to the byte immediately following the vlc in which the offset is stored. Thus an offset of 0 means the data being pointed to immediately follows the vlc. File header: - magic number: 8 bytes - version: vlc - cell width: vlc (1 byte) - cell height: vlc (1 byte) - tree depth: vlc (1 byte) - tree offset: vlc - context count: vlc - context spacing: vlc (1 byte) - offsets to individual context structures, padded with junk if necessary so that successive offset vlcs are spaced context_spacing bytes apart from one another. Tree nodes: - pivot: vlc - right offset: vlc The left child of each node begins immediately following the last field of the node structure. The right child is indexed by the right offset vlc. Pivot is the first codepoint that falls on the right child, expressed relative to the parent node's pivot element (which is implicitly 0 at the root). After following the full depth of the tree as specified in the header (which may be 0, i.e. degenerate), the resulting position contains a block. Blocks represent (hopefully) contiguous ranges of (hopefully) uniform-width characters. The block structure itself has exactly one field, a vlc "stride" indicating the difference between successive character record positions in the block. The first character of the block begins immediately after the stride vlc. In particular, if stride is 0, all characters in the block are represented by a single record. This is particularly useful for representing ranges of unassigned code points or characters not covered by the font. A character record begins with an opcode vlc. The following values are special: - 0: a glyph to be used immediately follows the vlc - 1: a vlc offset immediately follows the vlc. if the offset is 0 then no glyph is available for the character. otherwise process the character record at the specified offset. - 2-3: reserved - 4-...: context code from the context table, followed by a vlc offset. if the character's context matches the context code, then the glyph to be used is located at the specified offset. otherwise, process the character record immediately following the offset vlc. In all cases, glyph data is just the raw scanline bits. In general, a well-formed file looks like: 1. Header 2. Context tables 3. Binary tree with 4-8 levels to avoid the need for a huge (million entry) table of character record pointers, weed out empty codepoints and group double-width characters together. 4. Within each tree leaf, either fixed-size records for each character or a list of pointers to the actual character records. The program writing the file can select which works best for its purposes. 5. Character records which select a glyph based on the table of contexts. These can be essentially degenerate for characters represented by exactly one glyph. 6. Glyphs can be stored alongside the last of the character records they service (which helps save some space on offsets) or at the end of the file, or anywhere else you can manage to put them. Nice properties: Computational overhead is very low. Finding a glyph to match a character and context requires one conditional and "depth" memory accesses to reach the block head, a small amout of arithmetic to reach the character record, and then whatever conditionals the context rules require (just one for ligature-free characters). Cost rises somewhat if you want to tolerate corrupt font files gracefully but remains inexpensive. Memory overhead is low. In addition, while not required it is certainly possible and encouraged that related glyphs be kept close together. This will promote efficient demand-paging so that scripts not in use do not consume memory. The format was designed so that it can be parsed on demand while mmapped, without the need to predecode and cache large tables in process-local memory. Embedded systems may use a compressed filesystem or emulate something similar in userspace. Assuming a source format of the form "collection of glyphs with each glyph listing which characters it can represent and in what contexts it can do so with priorities assigned to each context to resolve conflicts", the above mentioned binary format can be transformed back and forth to and from the source format without loss of important information. In particular, the notion that a single glyph is representing several characters need not be lost. Other attributes such as a descriptive name for the glyph or comments may of course be lost however. Expected size for a complete BMP font at 8x16 is 1.7-2 megs. Omitting Traditional Chinese could reduce this (how much? what portion of the CJK characters are Traditional-only?) and leave most users happy. Omitting CJK entirely leaves the expected size around 200-250k. Known Limitations: Glyph width is stored nowhere in the font file. This is intentional because the font does not get to decide on character width; it must match wcwidth(). Yet it does require the user to have a working wcwidth implementation supporting any character that will be used. In the absence of such a routine, the font file is still parsable and valid but display will be incorrect. One possible "fix" for this is to use separate left-half and right-half "glyphs" for double-width characters and define contexts of "left half" and "rigth half". This approach may more accurately reflect the needs of applications anyway. It's also possible that full-width versions of ordinary narrow combining marks may be needed for use on CJK characters...? Can anyone comment on this? Variable-width fonts are not supported. Honestly I am not interested in them since variable-width bitmap fonts have no practical modern applications; printing and pretty layout like a web browser might due both require high-quality scalable fonts. I do have some ideas for how variable-width fonts could be supported if anyone really cares for them. References to common glyphs can only be forward references, not back references. If this is a problem a signed vlc could be used for the glyph offset. The chaining "opcode 1" is intentionally unsigned however to make infinite loops impossible so that an implementation does not have to waste complexity trying to detect them. It's always possible to store the glyphs such that forward references are sufficient, but for the sake of optimizing the number of pages mapped into ram, it may be desirable to locate a shared glyph with the more common range it is used in. However it's unlikely that an automated font compiler would know how to do this anyway. Context format for ligature rules has not been specified, pending expert advice on the matter. :) Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
