Please ignore my previous mail. Latest version of Uniscribe does not work that way.
I was using rather outdated version of Uniscribe until yesterday. At last today I had a chance to access a Windows 8 machine and used it for a while. In short, Uniscribe in Windows 8 is completely following KS X 1026-1 only and no more. Unicode spec has been thrown away. Personally I don't like it. Especially the reordering of Hangul tone marks was remarkable. Attached is a sample hangul text file. Some lines are well-formed; others contain mal-formed text. 2013/4/12 Behdad Esfahbod <[email protected]>: > Ok, I'm more confused now :). I'll find some time to put something together > and take it from there. In the mean time, if you can compile a list of > sequences that would test all the corner cases you can think of, that would > immensely help with the implementation. > > Thanks, > b > > On 13-04-10 02:45 AM, Dohyun Kim wrote: >> 2013/4/10 Dohyun Kim <[email protected]>: >>> 2013/4/10 Behdad Esfahbod <[email protected]>: >>>> Hi, >>>> >>>> Ok, what you describe sounds very close to the OpenType spec: >>>> >>>> http://www.microsoft.com/typography/otfntdev/hangulot/ >>>> >>>> and what the ICU Layout Hangul shaper does. >>>> >>>> The one part I don't understand is the section "Compose Old Hangul Jamo >>>> combinations" under: >>>> >>>> http://www.microsoft.com/typography/otfntdev/hangulot/shaping.htm >>>> >>>> I can't make sense of that part, since Appendix B does not list what the >>>> jamos >>>> compose to. >>>> >>>> Please review those documents and share any insights you may have. I'll go >>>> ahead with implementing a shaper then. >>>> >>> >>> This Hangul Opentype spec from microsoft is quite outdated. It was >>> written in 2003, ten years ago from now. In the meantime, KS X 1026-1 >>> and Unicode 5.2 have been released in 2007 and 2009 respectively. >>> Unicode 5.2 has assigned code points to a number of new jamos, which >>> are U+115A..U+115E, U+11A3..U+11A7, U+11FA..U+11FF, U+A960..U+A97C, >>> U+D7B0..U+D7C6, and U+D7CB..U+D7FB. Consequently, those items in >>> Appendix B that you pointed out are now all have their unicode code >>> points. For instance, <U+1102 U+1109> has now become <U+115B>. >>> Before Unicode 5.2, Koreans could not help writing down <U+1102 >>> U+1109> to represent the composite jamo which is composed of Choseong >>> Nieun and Choseong Sios. Now it is a story of past. Anyway, you can >>> find full list of composite jamos with their elements composing them >>> at ftp://ktug.org/ktug/hcr-lvt/composejamotojamo.map which I have >>> shown before as a reference. >>> >>> Moreover, the microsoft spec has incorrect informations on several >>> points. The section "Compose Old Hangul Jamo combinations" is one of >>> them. This kind of jamo composition could not be done at pre-OTLS >>> stage brefore Unicode 5.2 was introduced, as there was no code points >>> of composed jamos at that time. Jamo-to-jamo composition could be >>> done only at the stage of applying "ccmp" font feature. Now we have >>> all composite jamos registered to Unicode, so a shaping engine can do >>> this composition before applying font features. However, this kind of >>> composition is contrary to the spec of KS X 1026-1. Section 5.3 of >>> this spec says that "two or more code positions of simple letters >>> cannot be concatenated to represent a single complex letter." >>> Certainly, this concatenation is allowed according to the Unicode >>> standard, though not recommended since the release of version 5.2. >>> Yes, we have just encountered another discrepancy between local and >>> global standards. But, in our pratice, Koreans do not input >>> decomposed jamos to represent a single composite jamo any more. Above >>> all, it turned out from my experiment on a windows machine that recent >>> version of Uniscribe does not compose jamo elements to a composte >>> jamo, even for those jamos which were not available before Unicode >>> 5.2. So I think it is better for us to ignore the section "Compose >>> Old Hangul Jamo combinations" and its Appendix B altogether. >>> >>> Instead, Uniscribe sets boundaries between syllable blocks as I >>> mentioned before. As we know that all the single and composte jamos >>> have their own code points, the rule to identify syllable blocks is >>> quite simple now: >>> L V T? M? >> >> Today I have tested Uniscribe again. It turned out that Uniscribe >> does not simply apply this rule to identify syllable blocks. When a >> jamo sequence is a candidate to be composed to a composite jamo newly >> added to Unicode 5.2, Uniscribe considers it as a single jamo, though >> it does *not* actually compose the sequence to the composite jamo. As >> this may be a little confusing, let us take some examples. For each >> input text of left side, Uniscribe sets boundaries as the right side: >> >> <U+1100 U+1100 U+1161> => <U+1100 | U+1100 U+1161> => <U+1100 | U+AC00> >> >> <U+1100 U+1100> is a sequence which can be concatenated to <U+1101>. >> However, Uniscribe divides them into two syllable blocks, because >> U+1101 has been registered to Unicode from its very early versions. >> >> <U+1103 U+1106 U+1161> => <U+1103 U+1106 U+1161> >> >> <U+1103 U+1106> can be concatenated to <U+A960>, a newly registred >> jamo by Unicode version 5.2. In this case Uniscribe considers them as >> a single composite jamo and so does not set boundary between U+1103 >> and U+1106. Notice that Uniscribe does not actually compose these >> element jamos to U+A960, just allowing font features do their job. >> >> <U+1100 U+1161 U+11AB U+11AB> => <U+1100 U+1161 U+11AB U+11AB> >> >> In a similar fasion, as <U+11AB U+11AB> can be concatenated to >> <U+11FF> which is a newly added jamo, Uniscribe does not divide >> syllable blocks in-between. >> >> This policy of Uniscribe seems to be a little complicated. But it >> must be quite resonable as it also supports old documents which had >> been written before Unicode 5.2 was introduced, ensuring backward >> compatibility. >> >> >>> where L is leading consonants including Choseong filler; V is medial >>> vowel including Jungseong filler; T is trailing consonants; M is >>> Hangul Tone Marks (U+302E U+302F); and ? meands zero or one occurrence >>> of specified character. Before or after these jamo sequence, >>> uniscribe seems to set boundaries. And what is important is that >>> Uniscribe composes jamos to syllable only when complete sequence of <L >>> V T?> matches precomposed Hangul syllable. In other words, <L V OT> >>> is not composed and Uniscribe passes the sequence intact to the OTLS >>> precess. >>> >>> Thanks a lot for your effort to support Hangul. >>> Best, >>> >>>> >>>> On 13-04-06 01:32 PM, Dohyun Kim wrote: >>>>> 2013/4/6 Behdad Esfahbod <[email protected]>: >>>>>> On 13-04-05 06:45 AM, Dohyun Kim wrote: >>>>>>> 2013/4/5 Dohyun Kim <[email protected]>: >>>>>>>> Sorry for the noise. >>>>>>>> I have booted on Windows machine and tested uniscribe a bit. My guess >>>>>>>> on how uniscribe works on Hangul is: >>>>>>>> >>>>>>>> 1. decompose hangul syllables to jamos >>>>>>>> >>>>>>>> 2. compose single jamos to composite jamo as possible as can be >>>>>>>> eg., U+1100 U+1100 => U+1101 >>>>>>>> Note: mapping table for this composition is available at >>>>>>>> ftp://ktug.org/ktug/hcr-lvt/composejamotojamo.map >>>>>>>> >>>>>>> >>>>>>> Well, after a bit more test, it turned out that this second process is >>>>>>> not what uniscribe does. Sorry for my wrong information. I have >>>>>>> guessed this on the basis of old unicode standard. Recently unicode >>>>>>> also does not recommend to use multiple single jamos to get composite >>>>>>> jamo. >>>>>>> >>>>>>> Instead, uniscribe inserts fillers (U+115F U+1160) around single >>>>>>> lonely jamo which do not make up syllable block. >>>>>> >>>>>> Interesting. So, for a lone T jamo, both 115F and 1160 are inserted? >>>>> >>>>> Yes, when fillers are inserted. But actually uniscribe does not seem >>>>> to insert fillers. Sorry for my immuture conclusion. Today I have >>>>> downloaded harfbuzz win32 binary and tested some jamo texts using >>>>> hb-shape. This utility gave me more accurate information than I could >>>>> obtain with the naked eye. Contrary to my expectation, the output of >>>>> hb-shape did not have any traces of fillers. So, it seems evident >>>>> that uniscribe does not insert fillers. And it seems also evident >>>>> that uniscribe sets boundaries between syllable blocks, so that >>>>> multiple single jamos could not be concatenated to composite jamo. >>>>> >>>>> Let us suppose an input text <U+1100 U+AC00 U+11F0>. I guess what >>>>> uniscribe does: >>>>> >>>>> 1. decompose syllables to jamos: we get <U+1100 U+1100 U+1161 U+11F0> >>>>> >>>>> 2. demarcate each syllable block by setting boundaries in-between: we >>>>> get <U+1100 | U+1100 U+1161 U+11F0> where | means syllable boundary. >>>>> Probably this is related to the so-called "cluster." Yesterday I >>>>> misconceived this boundary (maybe ZWNJ but I am not sure) as a filler. >>>>> BTW, according to the old standard, U+1100 U+1100 are concatenated to >>>>> U+1101, so the result will be a single syllable block <U+1101 U+1161 >>>>> U+11F0>. Nowadays we do not need this jamo-to-jamo composition, >>>>> because all the jamos known until today are now registerd since >>>>> unicode version 5.2. >>>>> >>>>> 3. try to re-compose jamos to syllablle letter. But as our sample >>>>> text matches the case of <L V OT>, nothing is converted. >>>>> >>>>> 4. apply font features: we get <U+1100 | U+1100.s U+1161.s U+11F0.s> >>>>> where ".s" means sustituted glyph. >>>>> >>>>> As I said before, we Koreans do not input text like <U+AC00 U+11F0> in >>>>> their practice. However, there remains some possibility that some >>>>> applications or libaries do pass to harfbuzz some unicode-normailized >>>>> text, in which case hafbuzz would give us incorrect result. So I >>>>> changed my mind, and now I suggest an implementation of hangul shaper. >>>>> It is not an urgent task, though; harfbuzz works quite well already. >>>>> However, we want harfbuzz as perfect as possible. >>>>> >>>>> Regards, >>>>> >>>>> >>>>>>>> 3. compose jamos to hangul syllable as possible as can be >>>>>>>> Note: this process complies with KSC 1026-1. In other words, jamo >>>>>>>> sequence <L V> in <L V OT> is *not* converted to LV, where L means >>>>>>>> leading consonant, V means medial vowel, OT means *old* trailing >>>>>>>> consonant (U+11C3..U+11FF U+D7CB..U+D7FB), and LV means Hangul >>>>>>>> syllable equivalent to L V. >>>>>>>> >>>>>>>> 4. apply opentype layout features >>>>>>>> >>>>>>>> It is somewhat complicated but gives perfect result. It satisfies >>>>>>>> both the Korean and Unicode standards. Nevertheless, what current >>>>>>>> hafbuzz does is quite excellent as well and I am satisfied with it. I >>>>>>>> am reporting just for reference. >>>>>>>> >>>>> >>>> >>>> -- >>>> behdad >>>> http://behdad.org/ >>> >>> >>> >>> -- >>> Dohyun Kim >>> College of Law, Dongguk University >>> Seoul, Republic of Korea >> >> >> >> -- >> Dohyun Kim >> College of Law, Dongguk University >> Seoul, Republic of Korea >> > > -- > behdad > http://behdad.org/ -- Dohyun Kim College of Law, Dongguk University Seoul, Republic of Korea
듀ᇰ귁〮 듀ᇰ귁〮 ᄇᆡᆨ셔ᇰ ᄇᆞᅵᆨ셔ᇰ ᄃᅠᄅᆞᆯ ᅞᆞᆯ ᄃᄅᆞᆯ ᅟᆞᇰᅟᅠᇰ ᅟᆞᇰᇰ 쓔ᇙ〯ᄍᆞᆼ 쓔ᇙ〯ᄍᆞᆼ ꥪퟁ ᄅᄇ비ᅩᅵ ꥻᅵᆫ ᄒ신
_______________________________________________ HarfBuzz mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/harfbuzz
