Re: [HarfBuzz] 'vert' substitutions in CJK fonts

2013-02-06 Thread Behdad Esfahbod
On 13-02-04 12:34 PM, Grigori Goronzy wrote:
 
 The GSUB table of problematic fonts typically looks a bit too...
 minimal. Here's an example, that's the MOTOYA LMaru font, Android's
 standard CJK font:
...
 So subtitutions are only used if the run that is shaped has Katakana
 (kana) script and language is set to Japanese (JAN). It works if I
 explicitly set the language and force the script to Katakana.
 
 But, in practice, that's of course not true! First, it breaks as soon as
 the system language is not Japanese, unless the language has been
 overridden. Second, not only Katakana characters have vertical variants.
 Punctuation might or might not be substituted depending on context,
 because punctuation characters have common script and assume the script
 of characters around them. If they're next to Kanji characters, it will
 break.

Grigori, welcome to the darker sides of text rendering :).

This is what Pango does, and what eventually I want to make easier doing with
HarfBuzz:

  - Say, system language is en.  Upon detecting Katakana, Pango then proceeds
to resolve the language to assign to that run of text.  Pango knows what
scripts each language tag (locale) uses.  As such it correctly detects that
English doesn't use Katakana, and as such this run can't be in English.  It
then goes searching for a better language tag for the run:

* If env vars $LANGUAGE and/or $PANGO_LANGUAGE are set, it looks in the
languages listed there (those are each a list of language tags), and picks the
first one that uses Katakana,

* If that fails, it knows that most likely language tag for Katakana is
ja, so it uses that.

This is really useful.  For example, by default when Pango sees text in Arabic
script, it behaves as if it's in Arabic language.  But if I set LANGUAGE=en,fa
in my system, then Pango will attribute untagged Arabic script text to Persian
instead of Arabic language.

This is all in pango-language.c.  Check it out.

The other problem you point it is also handled by resolving
Script=Common/Inherited characters to their neighboring scripts.  So in this
case, even the punctuation will be marked 'kana'.

That said, it is a known shortcoming of OpenType, that a lone punctuation
character cannot hit any script tables other than DFLT...


 Should fonts with GSUB tables like that considered broken?

Yes, it should define a default language system.  But then, many fonts in
common use are broken one way or another...


 What does Uniscribe do to make this work?

Don't know.

 And lastly, can I force HarfBuzz to just
 use the first 'vert' substitution lookup in case there's none to be
 found with matching or DFLT script/language system?

Not really / easily.  You can use the hb-ot.h API to detect that, and find the
OT LangSys tag that *does* have the substitution, then use
hb_ot_tag_to_language() to get a language tag that when passed back to
HarfBuzz, will choose that substitution.

Cheers,

-- 
behdad
http://behdad.org/
___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/harfbuzz


Re: [HarfBuzz] 'vert' substitutions in CJK fonts

2013-02-04 Thread suzuki toshiya
Hi,

Thank you for opening the interesting discussion.
I think, the script names in OpenType spec are not identical with the
block names in Unicode; kana does not specify the small group of katakana
and hiragana, but also specify the group including katakana, hiragana,
CJK ideographs, CJK punctuations, CJK symbols etc etc.

When I worked for poppler (PDF rendering library), I got similar problem;
http://lists.freedesktop.org/archives/poppler/2012-March/008860.html
I should note that the default language system strategy would not work
well with (old versions of?) Batang font (a Korean font bundled to Microsoft
Windows).

when vertical text is requested without embedded font, how OpenType layout
feature should be configured; I used the combinations CHN/hani for Chinese
Simplified or Traditional, JAN/kana for Japanese, KOR/hang for Korean.
But it was designed to fit the internal design of the poppler, more
comprehensive consideration would be expected for real i18n software.

Regards,
mpsuzuki

Grigori Goronzy wrote:
 Hi,
 
 a user of my library reported that vertical alternates are not correctly
 subtituted in many CJK fonts. I am a bit puzzled by this.
 
 When doing vertical layout of Japanese text, the 'vert' feature is
 enabled in the library to select vertical variants of some Kanji, Kana
 and punctuation characters. This works fine with many fonts, but with
 some it does not.
 
 The GSUB table of problematic fonts typically looks a bit too...
 minimal. Here's an example, that's the MOTOYA LMaru font, Android's
 standard CJK font:
 
   Table  0 of 16: GSUB (0x010c+0x016c)
 1 script(s) found in table
 Script  0 of  1: kana
   No default language system
   1 language system(s) found in script
   Language System  0 of  1: JAN 
 No required feature
 1 feature(s) found in language system
 Feature index  0 of  1: 0
 1 feature(s) found in table
 Feature  0 of  1: vert; 1 lookup(s)
 1 lookup(s) found in feature
 Lookup index  0 of  1: 0
 1 lookup(s) found in table
 Lookup  0 of  1: type 1, props 0x0001
 
 So subtitutions are only used if the run that is shaped has Katakana
 (kana) script and language is set to Japanese (JAN). It works if I
 explicitly set the language and force the script to Katakana.
 
 But, in practice, that's of course not true! First, it breaks as soon as
 the system language is not Japanese, unless the language has been
 overridden. Second, not only Katakana characters have vertical variants.
 Punctuation might or might not be substituted depending on context,
 because punctuation characters have common script and assume the script
 of characters around them. If they're next to Kanji characters, it will
 break.
 
 Should fonts with GSUB tables like that considered broken? What does
 Uniscribe do to make this work? And lastly, can I force HarfBuzz to just
 use the first 'vert' substitution lookup in case there's none to be
 found with matching or DFLT script/language system?
 
 Best regards
 Grigori
 ___
 HarfBuzz mailing list
 HarfBuzz@lists.freedesktop.org
 http://lists.freedesktop.org/mailman/listinfo/harfbuzz

___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/harfbuzz


Re: [HarfBuzz] 'vert' substitutions in CJK fonts

2013-02-04 Thread Grigori Goronzy
On 02/05/2013 01:50 AM, suzuki toshiya wrote:
 Hi,
 
 Thank you for opening the interesting discussion.
 I think, the script names in OpenType spec are not identical with the
 block names in Unicode; kana does not specify the small group of katakana
 and hiragana, but also specify the group including katakana, hiragana,
 CJK ideographs, CJK punctuations, CJK symbols etc etc.


They are not identical to Unicode, but kana indeed means just Katakana
and Hiragana in OpenType, at least according to the specification:

https://www.microsoft.com/typography/otspec/scripttags.htm

The lack of detail in the OpenType specification is really bad... in
this case it just says it's not always similar to Unicode but doesn't
explain how it differs from it either. :(

 When I worked for poppler (PDF rendering library), I got similar problem;
 http://lists.freedesktop.org/archives/poppler/2012-March/008860.html
 I should note that the default language system strategy would not work
 well with (old versions of?) Batang font (a Korean font bundled to Microsoft
 Windows).


Hmm, interesting... but lack of language-specific matching is not the
problem here.

 when vertical text is requested without embedded font, how OpenType layout
 feature should be configured; I used the combinations CHN/hani for Chinese
 Simplified or Traditional, JAN/kana for Japanese, KOR/hang for Korean.
 But it was designed to fit the internal design of the poppler, more
 comprehensive consideration would be expected for real i18n software.


So you use a fixed script for a given language? I don't know, but this
seems to be quite hacky. Often you don't even know what language you're
going to display. This might work in poppler's case, but in my case
(render some line of Unicode text with arbitrary languages) it does not.

Best regards
Grigori
___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/harfbuzz


Re: [HarfBuzz] 'vert' substitutions in CJK fonts

2013-02-04 Thread suzuki toshiya
Hi,

Grigori Goronzy wrote:
 On 02/05/2013 01:50 AM, suzuki toshiya wrote:
 Hi,

 Thank you for opening the interesting discussion.
 I think, the script names in OpenType spec are not identical with the
 block names in Unicode; kana does not specify the small group of katakana
 and hiragana, but also specify the group including katakana, hiragana,
 CJK ideographs, CJK punctuations, CJK symbols etc etc.

 
 They are not identical to Unicode, but kana indeed means just Katakana
 and Hiragana in OpenType, at least according to the specification:
 
 https://www.microsoft.com/typography/otspec/scripttags.htm

Oh, I had overlooked that kana appears twice for Hiragana and Katakana.

 The lack of detail in the OpenType specification is really bad... in
 this case it just says it's not always similar to Unicode but doesn't
 explain how it differs from it either. :(

Indeed. I should ask SC29/WG11 font AHG people for the possibility
of further clarification.

 When I worked for poppler (PDF rendering library), I got similar problem;
 http://lists.freedesktop.org/archives/poppler/2012-March/008860.html
 I should note that the default language system strategy would not work
 well with (old versions of?) Batang font (a Korean font bundled to Microsoft
 Windows).

 
 Hmm, interesting... but lack of language-specific matching is not the
 problem here.

I'm sorry - yes, the font referrers for the non-embedded CID-keyed
font in PDF often provide the additional information about the script,
so, such method cannot be applied to the rendering of the plain
Unicode text.

 when vertical text is requested without embedded font, how OpenType layout
 feature should be configured; I used the combinations CHN/hani for Chinese
 Simplified or Traditional, JAN/kana for Japanese, KOR/hang for Korean.
 But it was designed to fit the internal design of the poppler, more
 comprehensive consideration would be expected for real i18n software.

 
 So you use a fixed script for a given language? I don't know, but this
 seems to be quite hacky. Often you don't even know what language you're
 going to display. This might work in poppler's case, but in my case
 (render some line of Unicode text with arbitrary languages) it does not.

Indeed, it's not easy to guess the language from the plain Unicode text.

I understand as the problem in your first post was that the OpenType script tag
has an ambiguity, or, is less-practical (if the spec is understood precisely)
to cover the Unicode characters to be controlled by the feature. It's correct?

Regards,
mpsuzuki
___
HarfBuzz mailing list
HarfBuzz@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/harfbuzz