Re: Decomposing composed glyphs

Lawrence D'Oliveiro Fri, 06 Jun 2025 17:04:10 -0700

On Fri, 6 Jun 2025 16:58:14 +0200, Cassie D wrote:

> I have noticed in font editor FontForge that most fonts usually build
> some glyphs by referring to other glyphs like
> 
>    - Ä (U+00C4) by referring the two base glyphs diaeresis (either
> U+00A8 or U+0308, with one usually referring the other) and the basic
> latin letter A
>    - ç (U+00E7) refers to cedilla (U+00B8 or 0x0327) and letter c.
>    - greek and cyrillic letters also usually refer to the equivalent
> latin letter when they are identical.
>    - I've also seen a font, I don't remember which one, in which the
> German ß (U+00DF) refers to the letter S twice (as it is the
> equivalent of a double S).


There are three different concepts here that need to be clarified, to
keep the discussion from getting completely lost: “characters”, “code
points” and “glyphs”. There is a Unicode document
<https://www.unicode.org/faq/char_combmark.html> which uses the term
“grapheme” in place of “character” because of the many confusing
connotations which have become attached to the latter: it is defined as
“a minimally distinctive unit of writing in the context of a particular
writing system”. The entities that are listed in Unicode tables are
called “code points”.

So your first two examples involve graphemes which have alternative
representations (composed versus decomposed) in terms of Unicode code
points. This happens with many common graphemes/characters, whereas
with less common ones, only the decomposed forms are available.

Glyphs, on the other hand, are geometric shapes that come from the font
design. The correspondence between glyphs and characters/graphemes can
be quite loose: the OpenType spec allows for a great deal of
flexibility. For example, ligatures can have their own glyphs that can
be automatically substituted during text rendering, rather than having
to have their own encoding in the text. Unicode includes some common
ligatures purely for historical reasons; it usually makes text
entry/editing/searching easier to avoid these.

As far as text rendering is concerned, alternative representations of
the same grapheme should always be rendered the same way: a capital-A
with umlaut is still the same capital-A with umlaut, regardless of
whether the composed or decomposed code-point sequence is used.

Letters from multiple alphabets looking the same may have to do with
the history of how they originated, though sometimes this might be a
historical coincidence. There may indeed be multilingual fonts that
assign the same glyph to the different characters, but nevertheless
they are still distinct characters.

The German double-S has its own Unicode code point, but I’m not sure how
Germans look on this character; is it considered a separate
character/grapheme? Or just an alternative ligature representation of
“ss”?

There’s probably more to be said, but I think that gives you an idea ...

Re: Decomposing composed glyphs

Reply via email to