Re: [HarfBuzz] Hangul GSUB features

Jonathan Kew Sat, 25 Jan 2014 07:36:00 -0800

On 24/1/14 19:26, [email protected] wrote:

Hi, I'm the maintainer of the Jieubsida fonts.  Dohyun Kim kindly drew my
attention to the recent discussion on this list of changes to HarfBuzz's
hangul support and how it relates to these fonts, and I wanted to make some
comments and ask some questions.  This is a lengthy message, but I'm trying
to be very specific about the details, because those are important.

Hi Matthew - Thanks for your message, and for working through thedetails so carefully. I'll try to respond and clarify where I can...

These fonts are intended to be able to typeset the full range of hangul
defined in Unicode - including both the precomposed syllable code points and
the (basic and extended) individual jamo.  So I want to be able to
typeset all these code point sequences, and typeset them identically, using
a single glyph that is a precomposed syllable:

    1. U+1100 U+1161 U+11B7 (choseong-kiyeok jungseong-a jongseong-mieum)
    2. U+AC00 U+11B7        (syllable-ga jongseong-mieum)
    3. U+AC10               (syllable-gam)

I'm not an expert on Unicode canonical equivalence, but I believe these
three sequences are canonically equivalent to each other under the rules
in sections 3.7 and 3.12 of the current Unicode standard
(http://www.unicode.org/versions/Unicode6.2.0/ch03.pdf).  Sequence 1 is the
canonical decomposition of all three.  If I'm reading the discussion of the
last few days correctly, it sounds like we're all more or less in agreement
on that.

Yes. These are defined to be canonically equivalent (now and forever, asUnicode stability policies prohibit any change), and therefore I believeit is appropriate for all three to be rendered identically.

(Incidentally, IIRC the "semi-composed" version (2) above does notcurrently work in Windows/Uniscribe. I consider that a defect, and amglad to note that harfbuzz does handle it correctly.)


I would also like to be able to typeset the extended compound jamo as nicely
as possible.  For instance, I would like these two sequences to both be
typeset with a single glyph that is a precomposed lead jamo cluster, to be
overlaid with additional glyphs for subsequent code points that would
describe the vowel and tail of the syllable:

    4. U+1107 U+1109 U+1110 (choseong-pieup choseong-sios choseong-thieuth)
    5. U+A972               (choseong-pieup-sios-thieuth)

Exactly which glyph is used for these two sequences should be
context-sensitive, determined by the following vowel and presence or absence
of a tail.  It looks to me like these may not be canonically equivalent
under Unicode; U+A972 does not canonically decompose, and I don't think
there is such a thing as canonical composition of jamo.  Nonetheless it
certainly appears that they should be understood as the same text,
describing the same fragment of a syllable.

This is a trickier area. As you note, these two sequences are *not*equivalent from a Unicode point of view, even though they "obviously"(to a human) describe the same fragment of text.


On Mon Jan 20, Jonathan Kew writes:

Is this actually important? Note that Windows behaves similarly, and so
data that has "spelled-out" representations of complex jamos won't work
there either. AIUI, the recommended practice is to use the precomposed
Unicode characters such as U+A972 directly - and because these do *not*
have decompositions, mixing the two forms will lead to confusion and
problems for users. Perhaps it's better that the non-preferred spelling
does not render "correctly".


Even if it's rare or discouraged for anyone to attempt to typeset sequences
like number 4 above, and even if Windows is broken, I would prefer that such
sequences should render correctly with my fonts and HarfBuzz.

If the sequences (4) and (5) were canonically equivalent, I would ofcourse agree wholeheartedly with this (it would be in the same categoryas (1)-(3) above). However, in Unicode terms they are not equivalent;moreover, the relevant Korean standard, at least, makes it clear that(5) is to be regarded as correct, and (4) should not be used.

Because (4) and (5) are not canonically equivalent, they will not*function* as equivalents in general-purpose Unicode-based software,even when that software is careful to respect Unicode rules forequivalence (e.g. by normalizing text prior to operations such assearch, indexing, etc.). Searching a document for a syllable thatcontains U+A972 will fail to find that "same" syllable if it was spelledusing U+1107 U+1109 U+1110.

And because these sequences are not equivalent, and will not be foldedtogether during normalization or other Unicode-aware operations, I thinkwe're actually doing users a *disservice* and hurting the reusability ofdata if we force them to display the same. This will mislead users intoexpecting interoperable behavior that will not actually work.


The way the Jieubsida fonts are currently intended to work is that after the
cmap table translates code points into a stream of glyphs, the code point
stream goes through the ccmp, ljmo, vjmo, and liga tables in that order.

In ccmp, the glyphs representing precomposed syllables like U+AC00 and
U+AC10 are split into their component jamo, and the glyphs representing
individual jamo are joined into glyphs representing clusters, where
possible.  Note that these tables of course operate on glyphs, not code
points - which becomes important later, when there are multiple glyphs for
the same nominal code point.  Although this isn't a deliberate design
feature, I think this table's effect is very similar to Unicode
canonicalization.  After this table, my code point sequences 1, 2, and 3
should all translate to glyph sequences "uni1100 uni1161 uni11B7" and 4 and
5 to the single glyph "uniA972".

In the ljmo table, glyphs for lead (choseong) jamo are substituted depending
on the shape of the vowel (jungseong) and whether there is a tail
(jongseong) jamo.  In the case of "uni1100 uni1161 uni11B7", the vowel is in
the "vertical" class and there is a tail, so the table selects the "layout
1" variant and the glyph sequence becomes "uni1100.l1 uni1161 uni11B7".

In the vjmo table, glyphs for the vowel may be substituted similarly.  In
the particular case of "uni1100.l1 uni1161 uni11B7", the default glyph for
U+1161 is correct for layout 1 and so there's no change.  If there were no
tail, it would choose a different layout including a substitution for
uni1161.

Finally, in the liga table, any sequences for which precomposed glyphs exist
are replaced by the precomposed glyphs.  Since there is a "uniAC10" glyph
corresponding to the sequence "uni1100.l1 uni1161 uni11B7", it will be used.
At this point all three of my sequences 1, 2, and 3 are typeset the way I
want them and that's great.

But some things to note:  if ljmo is not applied, then "uni1100" will not
change to "uni1100.l1" and then liga will not substitute uniAC10, so all
three sequences break.  If ccmp is not applied at all, then "uniAC00" will
not change to "uni1100 uni1161", none of the subsequent tables will match,
and sequence 2 breaks.  If ccmp is applied, but is not applied FIRST, then
there again the other tables will not see the glyph sequences they're
expecting, and again sequence 2 breaks.  If liga is not applied, then
(assuming everything else happens as expected) we end up with "uni1100.l1
uni1161 uni11B7" - typesetting the syllable in "layout 1" as if there were
no precomposed glyph, which will look okay but not as good as the
precomposed glyph should (because the precomposed glyph has a more
finely-adjusted layout).

My code point sequences 4 and 5 don't describe a full syllable, but if one
constructs a full syllable by adding one or more vowel and possibly tail
jamo, it will go through a similar process minus the precomposed-syllable
substitution at the end, because I have no precomposed syllables starting
with "pieup-sios-thieuth".  If ccmp runs and runs first, the result of the
whole process should look okay.  If ccmp does not run, then sequence 5
will result in good typesetting and sequence 4 won't; if ccmp runs but
does not run first, then sequence 5 may also end up incorrect depending on
the other jamo in the syllable.

The scheme above does everything I want it to do, with the versions of the
software I'm currently using.  With all due respect, it looks like you're
about to change HarfBuzz so that my fonts will no longer work, to tell me
that it's my own fault because I was doing it wrong all along, and to
suggest a way for me to redesign my fonts at considerable effort that will,
by design, not correctly handle all the cases the old one could correctly
handle.  This doesn't sound good to me, and I hope a better resolution is
possible.

On Thu Jan 23, Jonathan Kew writes:

So I think this is a font error. The font is using ccmp to decompose the
syllable AC00 into L and V jamos, but then expecting the shaper to apply
*jmo features to the resulting glyphs. That doesn't work, because


That is (as far as it goes) a correct description of what I expected the
shaper to do.  It's also what current XeTeX [using an older HarfBuzz], older
XeTeX [using ICU], and FontForge [using its own code] all seem to do if the
appropriate features are turned on.  It's not clear whether the need to turn
the appropriate features on is because those pieces of software don't
support Korean at all, or because they do support Korean and are correctly
not invoking the features under some rule I've been unaware of.  Until now I
always thought it was because of a complete absence of support.

Microsoft's documentation on ccmp at
    https://www.microsoft.com/typography/otspec/features_ae.htm#ccmp
says:

# Tag: “ccmp”
#
# Friendly name: Glyph Composition/Decomposition
#
# Registered by: Microsoft
#
# Function: To minimize the number of glyph alternates, it is sometimes
# desired to decompose a character into two glyphs. Additionally, it may be
# preferable to compose two characters into a single glyph for better glyph
# processing. This feature permits such composition/decompostion. The feature
# should be processed as the first feature processed, and should be processed
# only when it is called.
#
# Example: In Syriac, the character 0x0732 is a combining mark that has a dot
# above AND a dot below the base character. To avoid multiple glyph variants
# to fit all base glyphs, the character is decomposed into two glyphs...a dot
# above and a dot below. These two glyphs can then be correctly placed using
# GPOS. In Arabic it might be preferred to combine the shadda with fatha
# (0x0651, 0x064E) into a ligature before processing shapes. This allows the
# font vendor to do special handling of the mark combination when doing
# further processing without requiring larger contextual rules.
#
# Recommended implementation: The ccmp table maps the character sequence to
# its corresponding ligature (GSUB lookup type 4) or string of glyphs (GSUB
# lookup type 2). When using GSUB lookup type 4, sequences that are made up of
# larger number of glyphs must be placed before those that require fewer
# glyphs.
#
# Application interface: For GIDs found in the ccmp coverage table, the
# application passes the sequence of GIDs to the table, and gets back the GID
# for the ligature, or GIDs for the multiple substitution.
#
# UI suggestion: This feature should be on by default.
#
# Script/language sensitivity: None.
#
# Feature interaction: This feature needs to be implemented prior to any other
# feature.

Note that it's not specific to any particular language, it's described as
something that should always run, and it's described as running before any
other feature.  Adobe's version of the specification says pretty much the
same thing.  Microsoft's language-specific documentation for Korean at
   https://www.microsoft.com/typography/OpenTypeDev/hangul/intro.htm
also repeatedly describes ccmp as running before *jmo features, although it
also uses language like "Apply feature 'ccmp' to preprocess any glyphs that
require composition" which seems to imply that ccmp might not always run.
It does not mention any possibility of the *jmo features not running.

It's because of these documents, with checking against XeTeX and FontForge,
that I've written the Jieubsida substitution features the way I have.  It
sounds like HarfBuzz's intended architecture works something like
this, which is significantly different from the "always run ccmp, ljmo,
vjmo, and liga, in that order" my code currently expects:

    * Some sort of composition or decomposition is applied at the level of
      code points (not glyphs) to find syllable boundaries.  This operation
      is not intended to handle sequences of single jamo joining to form
      compound jamo such as my sequence 4 above.  The mapping at this stage
      is part of the "shaper" and not specified by the font.
    * The code points, and recognized syllables, are translated to glyphs by
      cmap.  If precomposed glyphs exist, they are used directly; otherwise
      the glyph stream consists of L, V, T triples (T allowed to be null),
      with the expectation that clusters (more than one jamo in a single
      L/V/T slot) were already combined in the input.

Yes (or a precomposed LV glyph may be used, if there was no following Twith which the L and V may need to interact).

At this stage, individual L, V and T glyphs are tagged with theappropriate *jmo feature that is to be applied. Precomposed (LV, LVT)glyphs do not get any of the *jmo features.

    * It is not clear to me whether the ccmp table is applied unconditionally
      at this point, nor what the conditions for it are if it's conditional.

ccmp is applied unconditionally to all the glyphs (but remember thatcanonical composition or decomposition may have occurred already at thecharacter level).

Note that if ccmp composes or decomposes glyphs, this will *not* affectwhich *jmo features are going to be applied; that was already decided bythe shaper based on its analysis above. (The normal expectation is thata Hangul font should not actually have any need for ccmp.)

    * Conditional on some assessment of the structure of the syllable
      (perhaps the existence of a precomposed glyph?) the *jmo features may
      be applied - presumably to the output of ccmp, if it was applied.

Yes - remembering that the decision as to which *jmo feature, if any,applies to a given glyph was made *before* ccmp, and knows nothing aboutany changes that happened there.

    * It is not clear to me under what circumstances liga may be applied.

liga is always applied (although a Hangul font wouldn't usually beexpected to need it). Also, note that liga is intended to be under usercontrol; although it's enabled by default, authors may turn it off(directly, or as a side-effect of other styling). You probably don'twant your basic Hangul support to break when ligatures are disabled.


So my first real questions are:  what exactly does HarfBuzz intend to do?
Is the above description correct as far as it goes, and if not, what would
be a correct description?  What are the answers to the unknown points?

What processing happens before code points change into glyphs?  Under what
circumstances will ccmp be applied to the glyph stream?  Under what
circumstances will *jmo be applied, and will the input to *jmo be the output
of ccmp (should it be applied) or something else?  Under what circumstances
will liga be applied?

On a meta-level:  where (or if) HarfBuzz's intended design differs from what
I think the standards require (such points as "ccmp always runs, and is
always first"), am I reading the wrong standards?  Is HarfBuzz's behaviour
based on an authority like a standard, stronger than the observed behaviour
of other software such as Uniscribe?  Or if it's based on the observed
behaviour of other software, which other software and why?  Are these points
documented anywhere?

There's the Hangul shaping document athttp://www.microsoft.com/typography/OpenTypeDev/hangul/intro.htm#features,but it's unclear and outdated in various respects.

In particular, it does not explicitly state whether the *jmo featuresare applied globally, or only to glyphs that the shaper identified asbeing in the correct place within a valid syllable. I believe theintended meaning (and observed Uniscribe behavior) is that thesefeatures are *selectively* applied to the individual glyphs only whenthey are found in an <L, V [, T]> sequence.

The ICU implementation, at least (and perhaps old HarfBuzz?), appliedthe *jmo features to L, V and T glyphs in a more general sequence of theform <L+, V+, T*>. This is why a "spelled-out" form such as your (4)above would have worked there; the ljmo feature was applied to all threeL characters. However, it also means (AIUI) that the feature will beapplied to a sequence of 4, 5, or even more Ls in succession, the ljmofeature will be applied even to those that cannot be part of a validsyllable and would be better left in their original form.


I would much prefer to have a clear description of what HarfBuzz is trying
to do and why, over advice on what Mandeubsida should do.  I don't expect
HarfBuzz's developers to alter their design to match what I think it should
be, not even if I think the standards may mandate such an alteration, and
I'm wary of altering my own design to suit a third-party package in
preference to my own reading of the standards.  Nonetheless, it sounds like
HarfBuzz developers do have some ideas regarding what I ought to do, and
since I want my fonts to work with HarfBuzz, those ideas are worth
thinking about.

On Thu Jan 23, Jonathan Kew writes:

So the font is using the wrong strategy. It should be simplified to
remove the syllable decompositions from ccmp; that's handled by the
shaper itself. (And it doesn't need the liga feature to reassemble the
original syllables, either, as the shaper won't decompose them unless
actually necessary, e.g. to support an <LV, T> sequence.)


If I'm understanding HarfBuzz's intended operation and this description
correctly, my sequence 3 (a single precomposed syllable) will be recognized
as a precomposed syllable, NOT decomposed, and will go directly through to
the precomposed glyph; that's fine.


Yes.

Sequences 2 (precomposed syllable plus
a tail) and 1 (separate lead, vowel, and tail, one of each) will be
recognized by the shaper (not by ccmp or liga) as adding up to a precomposed
syllable.  It's not clear to me whether then HarfBuzz will attempt to run
them through the *jmo features, but my guess is not - instead it will go
directly to the uniAC10 precomposed glyph.  That's good too.

Right. Provided the precomposed character is supported by the font, itwill be used (and no *jmo features applied).

So far it
sounds like I can get the desired behaviour just by removing the ccmp table,
and the recombination mappings from the liga table.  Less code needed from
me, still correct results, that's great.


I believe so, yes.


With sequence 5 (a cluster of lead jamo expressed as a single code point),
the desired behaviour is one glyph each for the cluster lead, the vowel, and
the tail if any, with the lead and vowel substituted in a context-sensitive
way depending on the shape of the vowel and presence or absence of a tail.
That appears to be the case in which HarfBuzz will invoke *jmo features to
choose the right context-sensitive glyphs; but it's not clear to me exactly
what the input to these features will look like.  Presumably with
documentation or experiments, I can figure that out.  I may be lucky enough
to find that the current substitution tables will work unmodified.

The L, V and T jamos will each be mapped to its default glyph via thecmap, and the respective ljmo, vjmo and tjmo features will be applied tothose.

Except that if you had a ccmp that broke the complex lead jamo intothree separate glyphs, that will presumably have been applied already. Ithink the ljmo feature would then get applied to all three of the simpleL glyphs, though I haven't double-checked this.


With sequence 4 (multiple lead jamo expressed as single jamo code points,
resulting in a single glyph for the cluster, chosen context-sensitively) it
appears that HarfBuzz is not intended to support that case, and the strategy
described above should not be expected to produce correct results with this
code point sequence.

Right; this sequence is not currently intended to be supported. Asdiscussed above, I am not convinced supporting this is a good thingoverall, because of its non-equivalence to sequence (5), itsincompatibility with Windows behavior, and its invalidity according tothe relevant Korean standard.

Note, also, that making the changes necessary to get
correct behaviour from the new HarfBuzz in the more common cases, will
apparently result in fonts that do not work on software (including earlier
versions of HarfBuzz) where the current Jieubsida fonts do work, even in the
more common cases.  These points are issues for me.

I believe you could make the fonts continue to work (in both old and newHarfBuzz, ICU, etc) by simply moving *all* your lookups into the ccmpfeature, and ignoring the Hangul-specific *jmo features altogether. Then(AIUI) they'd be applied to all the text, just as you expected, and itwould be entirely up to your (context-sensitive) lookups to decompose,choose forms, recompose, etc., as desired.

However, I don't actually recommend doing this; I think it's better forthe long-term interests of the Korean user community, Korean data on theWeb and elsewhere, etc., for everyone to conform to the currentrecommendation - as enshrined in Korean standards and implemented inWindows - that sequence (5) should be used, and not (4).


On Thu Jan 23, Jonathan Kew also writes:

The font should *not* use the generic ccmp feature to
decompose it, unless it intends to do *everything* using generic global
features, not the hangul-specific features.


Doing everything using generic global features may in fact be the best
solution for me.  Inasmuch as an OpenType contextual substitution table is a
finite-state transducer and such things are closed under composition, I can
reduce the current sequence of four tables which I want to all be applied
every time, to a sequence of fewer than four, maybe even just one table -
the size of that table may explode, but I can generate it algorithmically.
If I go this route, defining no *jmo tables, can I depend on ccmp and liga
always being applied and always in that order?

Currently, at least in harfbuzz, ccmp and liga (and the *jmo features,when used) are all applied "together", with the order of lookups beingtheir order in the font. This is the generic standard OpenType behavior(see "Features and Lookups", inhttp://www.microsoft.com/typography/otspec/chapter2.htm), and gives youas font designer control over how the lookups interact. Some shapersoverride this, and apply features individually (or in smaller groups),but we try to avoid doing so unless required for compatibility withUniscribe behavior.

So yes, you can depend on ccmp being applied. You shouldn't actually bedepending on liga for any of this, because it may be disabled due touser styling - e.g. when letter-spacing is used in Firefox, at least,liga is disabled - that would not normally be expected to break basicscript rendering.

Is there some longer
sequence of global tables I can depend on always being applied and always in
a specific order?

Remember that you can have a whole sequence of lookups within a singlefeature; you don't need multiple features to achieve this.

Will the "shaper", even in the absence of *jmo tables,
perform some translations on the sequence of code points that I need to know
about in building my substitution table(s)?

Yes; as described earlier, it will replace <L, V [, T]> and <LV, T>sequences with precomposed syllables where possible; and it will alsodecompose <LV, T> to <L, V, T> if a suitable <LVT> does not exist.However, I don't think this should matter to you, as your tables arepresumably designed to support these equivalents anyway.


Ever since attending Jin-Hwan Cho's talk at TUG 2013, it's been on my to-do
list to take a close look at Dohyun Kim's work in the HCR fonts.  Maybe now
is a good time for me to to do that.  I think the HCR fonts have a much
different architecture from mine because of using no precomposed syllables,
and many more on-the-fly layouts and jamo variants.  (I don't know if I
clearly addressed a question from Jin-Hwan Cho in our discussions at the
conference:  my fonts have at most five variants of each jamo, far fewer than
HCR, *but* I only use those variants at all when there's no precomposed
syllable.  The number of variants built into the precomposed syllables is
far greater.)  Presumably the HCR fonts have to solve similar problems to
mine of interacting predictably with the "shaper" and working well on a wide
range of software, so their solutions may be useful.



_______________________________________________
HarfBuzz mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/harfbuzz


_______________________________________________
HarfBuzz mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/harfbuzz

Re: [HarfBuzz] Hangul GSUB features

Reply via email to