To follow up on my original proposal and some of the alterrations and
simplifications I've made as a result of these discussions and
discussions with other people outside of this list, here's a summary
of the problem I'm trying to solve and how I plan to solve it:

Practical problems:
- no terminal emulators with broad support for scripts
- existing font formats but burden on software for performing layout
  and parts of the substitution, and have very high size overhead
- existing formats don't support more than 64k glyphs or characters
  (and if they do it'll be through some hacks...)

Ivory tower problems:
- existing font formats don't respect the unicode distinction between
  characters and glyphs and use hacks to work around this problem

Requirements:
- low overhead -- should average 4 or fewer bytes per character
- complete separation of the notation of character and glyph
- oriented towards character-cell devices
- ability to select the correct glyph when doing character-at-a-time
  rendering, without passing large strings to a substitution library
  or having to guess how much context is needed.
- "efficient access path to these glyphs" (-- Kuhn '99)
- derived from source format which consists of glyphs only along with
  lists of which characters each glyph can represent and under what
  conditions (Kuhn's idea).
- reversibility: must be feasible to recover source format from a
  "compiled" font file.

Implementation:
- completely interpreter based: glyph selector code interprets a
  program in the font file to map characters to glyphs
- interpreted language is intentionally extremely weak. does not admit
  any unbounded constructs.
- interpreted language is designed to be used efficiently by a font
  compiler starting from the source format with typical real-world
  font data.

"Variables" in interpreted language:
- character number, initially set to the desired character
- glyph number, initially set to zero

Operations in interpreted language (all args are unsigned):
- 0. end program, using current glyph number
- 1. if (ch>=arg1) { jump by arg2 code bytes; ch=0; }
- 2. glyph += ch*arg; ch=0;
- 3. jump by ch*arg code bytes; ch=0;
- 4. jump by arg bytes
- 5. if (in context specified by arg1) { glyph += arg2; end; }

Usage: 0 is obvious. 1 allows conditional treatment of ranges (best
use is to construct a binary tree with it), especially huge unassigned
or unsupported character ranges. 2 allows entire ranges without
ligatures or variants to map directly to glyphs without per-character
cpu-time or file-size overhead. 3 allows a sort of jump table where
the font can specify a code vector for each character in the range.
4 is self-explanatory.

Finally, 5 is the key feature. While the former ops are there to allow
efficient mapping of one million (or more..?) codepoints to glyphs, 5
allows conditional glyph selection based on context. Contexts are
essentially RE bracket expressions (much weaker than RE) for the
adjacent character positions surrounding the character whose glyph has
been requested. For example the 'low accent mark' context for latin
might be [acegijmnopqrsuvwxyz] immediately prior to the accent mark.

The specific precise requirements for context are one of the details
that I'm still working out, and which I would like help with since I'm
_not_ familiar with every script on the planet. Of course if I just go
with the draft spec and then refine it along the way while building my
font (with large parts derived from the GNU unifont project, but
corrected for the horrible character==glyph assumption it makes and
lack of correct nonspacing/wide glyphs), by the time it's done I'll
probably have something working very well.

A few more details: whenever you want to lookup a glyph for a
character, you begin at the interpreted program's entry point and
interpret it. Typically this process will begin with one or more type
1 operations to eliminate unused portions of the codepoint space, then
use type 2 (if the entire range maps directly to glyphs) or 3 (to
implement individual processing for each character in the range).
Optimality of the lookup process is dependent upon having a good
compiler to convert the source file to such a 'program', but due to
reversibility, a poorly compiled font can be restored to source and
recompiled with a better compiler.

Rich


P.S. I'm going to be travelling in Taiwan for the next week and a half
and not working on this project during that time. Please don't think
I've disappeared or dropped it. I'll be back (with more rigorous specs
and maybe some completed code) by the end of the month.


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to