On 14-01-09 01:55 AM, Ariel Malka wrote: > > https://github.com/arielm/Unicode/blob/master/Projects/ScriptDetector
This is awesome! Thank you. behdad > Feedback is welcome, > Ariel > > P.S. the next step is to mix script/lang items with BIDI items (the Mapnik > project should be very helpful here...) > > > On Mon, Dec 23, 2013 at 4:46 AM, Behdad Esfahbod <[email protected] > <mailto:[email protected]>> wrote: > > On 13-12-22 08:51 PM, Ariel Malka wrote: > > Thanks Behdad, the info on how it works in Pango is indeed super useful. > > > > > > An attempt to recap using my original Japanese example: > > > > ユニコードは、すべての文字に固有の番号を付与します > > > > ICU's "scrptrun" is detecting Katakana, Hiragana and Han scripts. > > > > > > Case 1: no "input list of languages" is provided. > > > > a) For Katakana and Hiragana items, "ja" will be selected, with the help > > of http://goo.gl/mpD9Fg > > In turn, MTLmr3m.ttf (default for "ja" in my system) will be used. > > So far so good. > > > > b) For Han items, no language will be selected because of > http://goo.gl/xusqwn > > At this stage, we still need to pick a font, so I guess we > > choose DroidSansFallback.ttf (default for Han in my system), unless... > > > > Some additional strategy could be used, like: observing the surrounding > items? > > Yes. All itemization issues can use surrounding context when in doubt... > It's just about managing complexity... > > > > Case 2: we use "ja" (say, collected from the locale) as "input language" > > > > For all the items, "ja" will be selected because the 3 scripts are > valid for > > writing this language, as defined in http://goo.gl/hwQri5 > > > > By the way, I wonder why Korean is not including Han > > (see http://goo.gl/bI5BLj), in contradiction to the explanations > > in http://goo.gl/xusqwn? > > Great point. The way the script-per-language was put together is using > fontconfig's orth files, which basically only list Hangul characters for > Korean. It definitely can be improved upon and I'm willing to hear from > roozbeh and others whether we have better data somewhere. > > behdad > > > > > > > > On Mon, Dec 23, 2013 at 1:35 AM, Behdad Esfahbod <[email protected] > <mailto:[email protected]> > > <mailto:[email protected] <mailto:[email protected]>>> wrote: > > > > On 13-12-22 06:17 PM, Ariel Malka wrote: > > >> As it happens, those three scripts are all considered "simple", > so the > > shaping > > >> logic in HarfBuzz is the same for all three. > > > > > > Good to know. For the record, there's a function for checking if a > script is > > > complex in the recent Harfbuzz-flavored Android OS: > http://goo.gl/KL1KUi > > > > Please NEVER use something like that. It's broken by design. It > exists in > > Android for legacy reasons, and will eventually be removed. > > > > > > >> Where it does make a difference > > >> is if the font has ligatures, kerning, etc for those. OpenType > organizes > > >> those features by script, and if you request the wrong script you > will miss > > >> out on the features. > > > > > > Makes sense to me for Hebrew, Arabic, Thai, etc., but I was bit > surprised to > > > find-out that LATN was also a complex script. > > > > LATN uses the "generic" shaper, so it's not complex, no. > > > > > > > So for instance, if I would shape some text containing Hebrew and > English > > > solely using the HEBR script, I would probably loose kerning and > ffi-like > > > ligatures for the english part > > > > Correct. > > > > > > > (this is what I'm actually doing currently in > > > my "simple" BIDI implementation...) > > > > Then fix it. BIDI and script itemization are two separate issues. > > > > > > >> How you do font selection and what script you pass to HarfBuzz > are two > > >> completely separate issues. Font fallback stack should be > per-language. > > > > > > I understand that the best scenario will always be to take > decisions > > based on > > > "language" rather than solely on "script", but it creates a > problem: > > > > > > Say you work on an API for Unicode text rendering: you can't > promise your > > > users a solution where they would use arbitrary text without > providing > > > language-context per span. > > > > These are very good questions. And we have answers to all. > Unfortunately > > there's no single location with all this information. I'm working > on > > documenting them, but looks like replying to you and letting you > document is > > better. > > > > What Pango does is: it takes an input list of languages (through > $LANGUAGE for > > example), and whenever there's a item of text with script X, it > assigns a > > language to the item in this manner: > > > > - If a language L is set on the item (through xml:lang, or > whatever else the > > user can use to set a language), and script X may be used to write > language L, > > then resolve to language L and return, > > > > - for each language L in the list of default languages $LANGUAGE, > if script > > X may be used to write language L, then resolve to language L and > return, > > > > - If there's a predominant language L that is likely for script X, > resolve > > to language L and return, > > > > - Assign no language. > > > > This algorithm needs two tables of data: > > > > - List of scripts a language tag may possibly use. This is for > example > > available in pango-script-lang-table.h. It's generated from > fontconfig orth > > files using pango/tools/gen-script-for-lang.c. Feel free to copy > it. > > > > - List of most likely language for each script. This is available > in CLDR: > > > > > > > > http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/likely_subtags.html > > > > Pango has it's own manually compiled list in pango-language.c > > > > Again, all these are on my plate for the next library I'm going to > design. It > > will take a while though... > > > > > > behdad > > > > > Or, to come back to the origin of the message: solutions like > ICU's > > "scrptrun" > > > which are doing script detection are not appropriate (because they > won't > > help > > > you finding the right font due to the lack of language context...) > > > > > > I guess the problem is even more generic, like with utf8-encoded > html pages > > > rendered in modern browsers, as demonstrated by the creator of > liblinebreak: > > > http://wyw.dcweb.cn/lang_utf8.htm > > > > > > On Sun, Dec 22, 2013 at 10:47 PM, Behdad Esfahbod > <[email protected] <mailto:[email protected]> > > <mailto:[email protected] <mailto:[email protected]>> > > > <mailto:[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>>> wrote: > > > > > > On 13-12-22 10:10 AM, Ariel Malka wrote: > > > > I'm trying to render "regular" (i.e. modern, horizontal) > Japanese with > > > Harfbuzz. > > > > > > > > So far, I have been using HB_SCRIPT_KATAKANA and it looks > similar > > to what is > > > > rendered via browsers. > > > > > > > > But after examining other rendering solutions I can see that > > "automatic > > > script > > > > detection" can often take place. > > > > > > > > For instance, the Mapnik project is using ICU's "scrptrun", > which, > > given the > > > > following sentence: > > > > > > > > ユニコードは、すべての文字に固有の番号を付与します > > > > > > > > would detect a mix of Katakana, Hiragana and Han scripts. > > > > > > > > But for instance, it would not change anything if I'd > render the > > sentence by > > > > mixing the 3 different scripts (i.e. instead of using only > > > HB_SCRIPT_KATAKANA.) > > > > > > > > Or are there situations where it would make a difference? > > > > > > As it happens, those three scripts are all considered > "simple", so > > the shaping > > > logic in HarfBuzz is the same for all three. Where it does > make a > > difference > > > is if the font has ligatures, kerning, etc for those. > OpenType > > organizes > > > those features by script, and if you request the wrong script > you > > will miss > > > out on the features. > > > > > > > > > > I'm asking that because I suspect a catch-22 situation > here. For > > > example, the > > > > word "diameter" in Japanese is 直径 which, given to "scrptrun" > > would be > > > > detected as Han script. > > > > > > > > As far as I understand, it could be a problem on systems > where > > > > DroidSansFallback.ttf is used, because the word would look > like in > > > Simplified > > > > Chinese. > > > > > > > > Now, if we were using MTLmr3m.ttf, which is preferred for > > Japanese, the word > > > > would have been rendered as intended. > > > > > > How you do font selection and what script you pass to HarfBuzz > are two > > > completely separate issues. Font fallback stack should be > per-language. > > > > > > > Reference: > https://code.google.com/p/chromium/issues/detail?id=183830 > > > > > > > > Any feedback would be appreciated. Note that the wisdom > > accumulated here > > > will > > > > be translated into tangible info and code samples (see > > > > https://github.com/arielm/Unicode) > > > > > > > > Thanks! > > > > Ariel > > > > > > > > > > > > _______________________________________________ > > > > HarfBuzz mailing list > > > > [email protected] > <mailto:[email protected]> > > <mailto:[email protected] > <mailto:[email protected]>> > > <mailto:[email protected] > <mailto:[email protected]> > > <mailto:[email protected] > <mailto:[email protected]>>> > > > > http://lists.freedesktop.org/mailman/listinfo/harfbuzz > > > > > > > > > > -- > > > behdad > > > http://behdad.org/ > > > > > > > > > > -- > > behdad > > http://behdad.org/ > > > > > > -- > behdad > http://behdad.org/ > > -- behdad http://behdad.org/ _______________________________________________ HarfBuzz mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/harfbuzz
