On Fri, 6 Jun 2025 16:58:14 +0200, Cassie D wrote: > I have noticed in font editor FontForge that most fonts usually build > some glyphs by referring to other glyphs like > > - Ä (U+00C4) by referring the two base glyphs diaeresis (either > U+00A8 or U+0308, with one usually referring the other) and the basic > latin letter A > - ç (U+00E7) refers to cedilla (U+00B8 or 0x0327) and letter c. > - greek and cyrillic letters also usually refer to the equivalent > latin letter when they are identical. > - I've also seen a font, I don't remember which one, in which the > German ß (U+00DF) refers to the letter S twice (as it is the > equivalent of a double S).
There are three different concepts here that need to be clarified, to keep the discussion from getting completely lost: “characters”, “code points” and “glyphs”. There is a Unicode document <https://www.unicode.org/faq/char_combmark.html> which uses the term “grapheme” in place of “character” because of the many confusing connotations which have become attached to the latter: it is defined as “a minimally distinctive unit of writing in the context of a particular writing system”. The entities that are listed in Unicode tables are called “code points”. So your first two examples involve graphemes which have alternative representations (composed versus decomposed) in terms of Unicode code points. This happens with many common graphemes/characters, whereas with less common ones, only the decomposed forms are available. Glyphs, on the other hand, are geometric shapes that come from the font design. The correspondence between glyphs and characters/graphemes can be quite loose: the OpenType spec allows for a great deal of flexibility. For example, ligatures can have their own glyphs that can be automatically substituted during text rendering, rather than having to have their own encoding in the text. Unicode includes some common ligatures purely for historical reasons; it usually makes text entry/editing/searching easier to avoid these. As far as text rendering is concerned, alternative representations of the same grapheme should always be rendered the same way: a capital-A with umlaut is still the same capital-A with umlaut, regardless of whether the composed or decomposed code-point sequence is used. Letters from multiple alphabets looking the same may have to do with the history of how they originated, though sometimes this might be a historical coincidence. There may indeed be multilingual fonts that assign the same glyph to the different characters, but nevertheless they are still distinct characters. The German double-S has its own Unicode code point, but I’m not sure how Germans look on this character; is it considered a separate character/grapheme? Or just an alternative ligature representation of “ss”? There’s probably more to be said, but I think that gives you an idea ...
