Re: [HarfBuzz] Ligatures

2020-05-22 Thread Richard Wordingham
On Fri, 22 May 2020 22:32:04 +0300
Eli Zaretskii  wrote:

> Can someone please tell what are the recommended practices regarding
> these ligatures?  Is the set of possible ligatures indeed infinite and
> impossible to know in advance?  And does HarfBuzz have APIs to query a
> font about the ligatures it supports?

hb_ot_layout_get_ligature_carets() is liable to be garbage in garbage
out.  While the cursor positions were included in OTL fonts to assist
cursor placement, it obviously fails when the components are stacked
vertically. Microsoft gave up on it and, if I remember the informal
statement correctly, just divides it up evenly between the characters
or grapheme clusters.  Many OpenType fonts don't populate the relevant
section of the GDEF table. And, of course, one has real trouble when
one glyph can come from different numbers of components.

LibreOffice takes (or took) a different approach, and uses the width of
the characters logically before the insertion point.  It's rather
disconcerting when the cursor jumps backwards as one steps through the
string.  It could happen with the Latin script string "a͡i", for the
'double' inverted breve should shorten when the second letter is 'i'.
One can get the effect in Indic scripts because of spacing viramas.

Richard.
___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/harfbuzz


Re: [HarfBuzz] Ligatures

2020-05-22 Thread Richard Wordingham
On Fri, 22 May 2020 22:32:04 +0300
Eli Zaretskii  wrote:

> Hi,
> 
> This is a bit off-topic, but I thought it could be appropriate to ask
> here, since we have here some of the best experts on this subject.
> 
> We are discussing support for ligatures in Emacs, specifically when
> using HarfBuzz as the shaping engine.  See the discussion from
> 
>   https://lists.gnu.org/archive/html/emacs-devel/2020-05/msg02493.html
> 
> The current support for producing ligatures works in the same way as
> complex text shaping for scripts that require that, like Arabic and
> Khmer: the sequences of characters that can be displayed as ligatures
> are identified in advance with suitable regular expressions, and the
> display engine then passes these sequences to hb_shape to produce the
> ligatures.
> 
> This works well for scripts that require complex shaping, because such
> scripts generally have well-defined rules for the sequences of
> codepoints that need shaping.

They may of course have more than one set of such rules, with the rule
sets defining different sets of sequences.

> My original thoughts were that
> ligatures could be supported in the same way, based on the assumption
> that the list of possible ligatures is finite and can be stored in a
> suitable data stricture in advance.

At one level, this is true for any individual font, for it cannot have
more than 65,536 glyphs.

> However, I'm being told that this assumption is false, and that each
> font defines ligatures from any number of arbitrary combinations of
> characters, and therefore the exhaustive list of the ligatures is in
> practice infinite and cannot be provided in advance.

This arbitrariness is true.  Over the set of all credible fonts for a
given character repertoire, the number of ligating combinations is
unbounded.

> The only way of
> doing this right, I'm told, is to either (a) query the font to get the
> list of all the ligatures it supports, or (b) assume any combination
> of characters can produce a ligature, and therefore we need to pass
> all the characters intended for display through hb_shape.  The latter
> in particular is in stark contrast to how the current Emacs display
> code is designed and implemented.

> To be specific, I'm talking about 2 kinds of ligatures:
> 
>   . ligatures made of Latin characters, like "ffi" and "Th"
>   . ligatures produced from symbols, like "==>" that is
> converted into ⟹
> 
> Can someone please tell what are the recommended practices regarding
> these ligatures?  Is the set of possible ligatures indeed infinite and
> impossible to know in advance?  And does HarfBuzz have APIs to query a
> font about the ligatures it supports?

Have you addressed the cursive scripts yet, such as Arabic?  At its
simplest, most consonants have four shapes, initial, medial, final and
isolated, and roughly speaking the shape used depends on the adjacent
spacing characters.  For the most part, Emacs would have to pass whole
words into HarfBuzz for shaping.  In some of the more advanced fonts,
the vowel marks in a word may also affect the shape of the consonant
skeleton.  And of course, sometimes the Arabic script prefers to join
letters vertically, as well as having a few straightforward ligatures.

A cursive Latin script font may behave in the same way, with the shape
of letters depending on what precedes and follows them.  With a small
enough character repertoire, there might be no ligatures, but your
rendering logic would fail miserably.

How would you handle the possibility that all three of <æ>,  and
 might be rendered by the same glyph, althouɡh they are
comprised of 1, 2 and 3 characters respectively?  And if Emacs is not
imposing a normalisation, then all the precomposed characters in
Unicode might have been entered as one or as more than one character? 

Richard.
___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/harfbuzz


[HarfBuzz] Ligatures

2020-05-22 Thread Eli Zaretskii
Hi,

This is a bit off-topic, but I thought it could be appropriate to ask
here, since we have here some of the best experts on this subject.

We are discussing support for ligatures in Emacs, specifically when
using HarfBuzz as the shaping engine.  See the discussion from

  https://lists.gnu.org/archive/html/emacs-devel/2020-05/msg02493.html

The current support for producing ligatures works in the same way as
complex text shaping for scripts that require that, like Arabic and
Khmer: the sequences of characters that can be displayed as ligatures
are identified in advance with suitable regular expressions, and the
display engine then passes these sequences to hb_shape to produce the
ligatures.

This works well for scripts that require complex shaping, because such
scripts generally have well-defined rules for the sequences of
codepoints that need shaping.  My original thoughts were that
ligatures could be supported in the same way, based on the assumption
that the list of possible ligatures is finite and can be stored in a
suitable data stricture in advance.

However, I'm being told that this assumption is false, and that each
font defines ligatures from any number of arbitrary combinations of
characters, and therefore the exhaustive list of the ligatures is in
practice infinite and cannot be provided in advance.  The only way of
doing this right, I'm told, is to either (a) query the font to get the
list of all the ligatures it supports, or (b) assume any combination
of characters can produce a ligature, and therefore we need to pass
all the characters intended for display through hb_shape.  The latter
in particular is in stark contrast to how the current Emacs display
code is designed and implemented.

To be specific, I'm talking about 2 kinds of ligatures:

  . ligatures made of Latin characters, like "ffi" and "Th"
  . ligatures produced from symbols, like "==>" that is
converted into ⟹

Can someone please tell what are the recommended practices regarding
these ligatures?  Is the set of possible ligatures indeed infinite and
impossible to know in advance?  And does HarfBuzz have APIs to query a
font about the ligatures it supports?

Thanks in advance for any help.
___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/harfbuzz