Is it too much to expect minority language users to specify the language they are using?Inconveniencing the 99% who was using Thai script to write Thai in order to help the 1% who are using Thai script to write minority languages doesn't seem like a good trade-off.
On Thu, Jan 9, 2014 at 12:01 PM, Martin Hosken <[email protected]> wrote: > Dear All, > > > > https://github.com/arielm/Unicode/blob/master/Projects/ScriptDetector > > > > This is awesome! Thank you. > > As I work with minority languages, automatic language detectors make me > shudder and cry. Please do not assume that because something is in, say > Thai script, that it is in Thai language. This is true for nearly every > script there is. > > Yours, > Martin > > > > > behdad > > > > > > > Feedback is welcome, > > > Ariel > > > > > > P.S. the next step is to mix script/lang items with BIDI items (the > Mapnik > > > project should be very helpful here...) > > > > > > > > > On Mon, Dec 23, 2013 at 4:46 AM, Behdad Esfahbod <[email protected] > > > <mailto:[email protected]>> wrote: > > > > > > On 13-12-22 08:51 PM, Ariel Malka wrote: > > > > Thanks Behdad, the info on how it works in Pango is indeed super > useful. > > > > > > > > > > > > An attempt to recap using my original Japanese example: > > > > > > > > ユニコードは、すべての文字に固有の番号を付与します > > > > > > > > ICU's "scrptrun" is detecting Katakana, Hiragana and Han scripts. > > > > > > > > > > > > Case 1: no "input list of languages" is provided. > > > > > > > > a) For Katakana and Hiragana items, "ja" will be selected, with > the help > > > > of http://goo.gl/mpD9Fg > > > > In turn, MTLmr3m.ttf (default for "ja" in my system) will be > used. > > > > > > So far so good. > > > > > > > > > > b) For Han items, no language will be selected because of > > > http://goo.gl/xusqwn > > > > At this stage, we still need to pick a font, so I guess we > > > > choose DroidSansFallback.ttf (default for Han in my system), > unless... > > > > > > > > Some additional strategy could be used, like: observing the > surrounding > > > items? > > > > > > Yes. All itemization issues can use surrounding context when in > doubt... > > > It's just about managing complexity... > > > > > > > > > > Case 2: we use "ja" (say, collected from the locale) as "input > language" > > > > > > > > For all the items, "ja" will be selected because the 3 scripts > are valid for > > > > writing this language, as defined in http://goo.gl/hwQri5 > > > > > > > > By the way, I wonder why Korean is not including Han > > > > (see http://goo.gl/bI5BLj), in contradiction to the explanations > > > > in http://goo.gl/xusqwn? > > > > > > Great point. The way the script-per-language was put together is > using > > > fontconfig's orth files, which basically only list Hangul > characters for > > > Korean. It definitely can be improved upon and I'm willing to > hear from > > > roozbeh and others whether we have better data somewhere. > > > > > > behdad > > > > > > > > > > > > > > > > > > On Mon, Dec 23, 2013 at 1:35 AM, Behdad Esfahbod < > [email protected] > > > <mailto:[email protected]> > > > > <mailto:[email protected] <mailto:[email protected]>>> wrote: > > > > > > > > On 13-12-22 06:17 PM, Ariel Malka wrote: > > > > >> As it happens, those three scripts are all considered > "simple", > > > so the > > > > shaping > > > > >> logic in HarfBuzz is the same for all three. > > > > > > > > > > Good to know. For the record, there's a function for > checking if a > > > script is > > > > > complex in the recent Harfbuzz-flavored Android OS: > > > http://goo.gl/KL1KUi > > > > > > > > Please NEVER use something like that. It's broken by > design. It > > > exists in > > > > Android for legacy reasons, and will eventually be removed. > > > > > > > > > > > > >> Where it does make a difference > > > > >> is if the font has ligatures, kerning, etc for those. > OpenType > > > organizes > > > > >> those features by script, and if you request the wrong > script you > > > will miss > > > > >> out on the features. > > > > > > > > > > Makes sense to me for Hebrew, Arabic, Thai, etc., but I > was bit > > > surprised to > > > > > find-out that LATN was also a complex script. > > > > > > > > LATN uses the "generic" shaper, so it's not complex, no. > > > > > > > > > > > > > So for instance, if I would shape some text containing > Hebrew and > > > English > > > > > solely using the HEBR script, I would probably loose > kerning and > > > ffi-like > > > > > ligatures for the english part > > > > > > > > Correct. > > > > > > > > > > > > > (this is what I'm actually doing currently in > > > > > my "simple" BIDI implementation...) > > > > > > > > Then fix it. BIDI and script itemization are two separate > issues. > > > > > > > > > > > > >> How you do font selection and what script you pass to > HarfBuzz > > > are two > > > > >> completely separate issues. Font fallback stack should be > > > per-language. > > > > > > > > > > I understand that the best scenario will always be to take > decisions > > > > based on > > > > > "language" rather than solely on "script", but it creates > a problem: > > > > > > > > > > Say you work on an API for Unicode text rendering: you > can't > > > promise your > > > > > users a solution where they would use arbitrary text > without providing > > > > > language-context per span. > > > > > > > > These are very good questions. And we have answers to all. > > > Unfortunately > > > > there's no single location with all this information. I'm > working on > > > > documenting them, but looks like replying to you and letting > you > > > document is > > > > better. > > > > > > > > What Pango does is: it takes an input list of languages > (through > > > $LANGUAGE for > > > > example), and whenever there's a item of text with script X, > it > > > assigns a > > > > language to the item in this manner: > > > > > > > > - If a language L is set on the item (through xml:lang, or > > > whatever else the > > > > user can use to set a language), and script X may be used to > write > > > language L, > > > > then resolve to language L and return, > > > > > > > > - for each language L in the list of default languages > $LANGUAGE, > > > if script > > > > X may be used to write language L, then resolve to language > L and > > > return, > > > > > > > > - If there's a predominant language L that is likely for > script X, > > > resolve > > > > to language L and return, > > > > > > > > - Assign no language. > > > > > > > > This algorithm needs two tables of data: > > > > > > > > - List of scripts a language tag may possibly use. This > is for > > > example > > > > available in pango-script-lang-table.h. It's generated from > > > fontconfig orth > > > > files using pango/tools/gen-script-for-lang.c. Feel free to > copy it. > > > > > > > > - List of most likely language for each script. This is > available > > > in CLDR: > > > > > > > > > > > > > > > > http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/likely_subtags.html > > > > > > > > Pango has it's own manually compiled list in pango-language.c > > > > > > > > Again, all these are on my plate for the next library I'm > going to > > > design. It > > > > will take a while though... > > > > > > > > > > > > behdad > > > > > > > > > Or, to come back to the origin of the message: solutions > like ICU's > > > > "scrptrun" > > > > > which are doing script detection are not appropriate > (because they > > > won't > > > > help > > > > > you finding the right font due to the lack of language > context...) > > > > > > > > > > I guess the problem is even more generic, like with > utf8-encoded > > > html pages > > > > > rendered in modern browsers, as demonstrated by the > creator of > > > liblinebreak: > > > > > http://wyw.dcweb.cn/lang_utf8.htm > > > > > > > > > > On Sun, Dec 22, 2013 at 10:47 PM, Behdad Esfahbod > > > <[email protected] <mailto:[email protected]> > > > > <mailto:[email protected] <mailto:[email protected]>> > > > > > <mailto:[email protected] <mailto:[email protected]> > > > <mailto:[email protected] <mailto:[email protected]>>>> wrote: > > > > > > > > > > On 13-12-22 10:10 AM, Ariel Malka wrote: > > > > > > I'm trying to render "regular" (i.e. modern, > horizontal) > > > Japanese with > > > > > Harfbuzz. > > > > > > > > > > > > So far, I have been using HB_SCRIPT_KATAKANA and it > looks > > > similar > > > > to what is > > > > > > rendered via browsers. > > > > > > > > > > > > But after examining other rendering solutions I can > see that > > > > "automatic > > > > > script > > > > > > detection" can often take place. > > > > > > > > > > > > For instance, the Mapnik project is using ICU's > "scrptrun", > > > which, > > > > given the > > > > > > following sentence: > > > > > > > > > > > > ユニコードは、すべての文字に固有の番号を付与します > > > > > > > > > > > > would detect a mix of Katakana, Hiragana and Han > scripts. > > > > > > > > > > > > But for instance, it would not change anything if > I'd render the > > > > sentence by > > > > > > mixing the 3 different scripts (i.e. instead of > using only > > > > > HB_SCRIPT_KATAKANA.) > > > > > > > > > > > > Or are there situations where it would make a > difference? > > > > > > > > > > As it happens, those three scripts are all considered > "simple", so > > > > the shaping > > > > > logic in HarfBuzz is the same for all three. Where it > does make a > > > > difference > > > > > is if the font has ligatures, kerning, etc for those. > OpenType > > > > organizes > > > > > those features by script, and if you request the wrong > script you > > > > will miss > > > > > out on the features. > > > > > > > > > > > > > > > > I'm asking that because I suspect a catch-22 > situation here. For > > > > > example, the > > > > > > word "diameter" in Japanese is 直径 which, given to > "scrptrun" > > > > would be > > > > > > detected as Han script. > > > > > > > > > > > > As far as I understand, it could be a problem on > systems where > > > > > > DroidSansFallback.ttf is used, because the word > would look > > > like in > > > > > Simplified > > > > > > Chinese. > > > > > > > > > > > > Now, if we were using MTLmr3m.ttf, which is > preferred for > > > > Japanese, the word > > > > > > would have been rendered as intended. > > > > > > > > > > How you do font selection and what script you pass to > HarfBuzz > > > are two > > > > > completely separate issues. Font fallback stack > should be > > > per-language. > > > > > > > > > > > Reference: > > > https://code.google.com/p/chromium/issues/detail?id=183830 > > > > > > > > > > > > Any feedback would be appreciated. Note that the > wisdom > > > > accumulated here > > > > > will > > > > > > be translated into tangible info and code samples > (see > > > > > > https://github.com/arielm/Unicode) > > > > > > > > > > > > Thanks! > > > > > > Ariel > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > HarfBuzz mailing list > > > > > > [email protected] > > > <mailto:[email protected]> > > > > <mailto:[email protected] > > > <mailto:[email protected]>> > > > > <mailto:[email protected] > > > <mailto:[email protected]> > > > > <mailto:[email protected] > > > <mailto:[email protected]>>> > > > > > > > http://lists.freedesktop.org/mailman/listinfo/harfbuzz > > > > > > > > > > > > > > > > -- > > > > > behdad > > > > > http://behdad.org/ > > > > > > > > > > > > > > > > > > -- > > > > behdad > > > > http://behdad.org/ > > > > > > > > > > > > > > -- > > > behdad > > > http://behdad.org/ > > > > > > > > > > -- > > behdad > > http://behdad.org/ > > _______________________________________________ > > HarfBuzz mailing list > > [email protected] > > http://lists.freedesktop.org/mailman/listinfo/harfbuzz > _______________________________________________ > HarfBuzz mailing list > [email protected] > http://lists.freedesktop.org/mailman/listinfo/harfbuzz >
_______________________________________________ HarfBuzz mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/harfbuzz
