2013/4/10 Dohyun Kim <[email protected]>: > 2013/4/10 Behdad Esfahbod <[email protected]>: >> Hi, >> >> Ok, what you describe sounds very close to the OpenType spec: >> >> http://www.microsoft.com/typography/otfntdev/hangulot/ >> >> and what the ICU Layout Hangul shaper does. >> >> The one part I don't understand is the section "Compose Old Hangul Jamo >> combinations" under: >> >> http://www.microsoft.com/typography/otfntdev/hangulot/shaping.htm >> >> I can't make sense of that part, since Appendix B does not list what the >> jamos >> compose to. >> >> Please review those documents and share any insights you may have. I'll go >> ahead with implementing a shaper then. >> > > This Hangul Opentype spec from microsoft is quite outdated. It was > written in 2003, ten years ago from now. In the meantime, KS X 1026-1 > and Unicode 5.2 have been released in 2007 and 2009 respectively. > Unicode 5.2 has assigned code points to a number of new jamos, which > are U+115A..U+115E, U+11A3..U+11A7, U+11FA..U+11FF, U+A960..U+A97C, > U+D7B0..U+D7C6, and U+D7CB..U+D7FB. Consequently, those items in > Appendix B that you pointed out are now all have their unicode code > points. For instance, <U+1102 U+1109> has now become <U+115B>. > Before Unicode 5.2, Koreans could not help writing down <U+1102 > U+1109> to represent the composite jamo which is composed of Choseong > Nieun and Choseong Sios. Now it is a story of past. Anyway, you can > find full list of composite jamos with their elements composing them > at ftp://ktug.org/ktug/hcr-lvt/composejamotojamo.map which I have > shown before as a reference. > > Moreover, the microsoft spec has incorrect informations on several > points. The section "Compose Old Hangul Jamo combinations" is one of > them. This kind of jamo composition could not be done at pre-OTLS > stage brefore Unicode 5.2 was introduced, as there was no code points > of composed jamos at that time. Jamo-to-jamo composition could be > done only at the stage of applying "ccmp" font feature. Now we have > all composite jamos registered to Unicode, so a shaping engine can do > this composition before applying font features. However, this kind of > composition is contrary to the spec of KS X 1026-1. Section 5.3 of > this spec says that "two or more code positions of simple letters > cannot be concatenated to represent a single complex letter." > Certainly, this concatenation is allowed according to the Unicode > standard, though not recommended since the release of version 5.2. > Yes, we have just encountered another discrepancy between local and > global standards. But, in our pratice, Koreans do not input > decomposed jamos to represent a single composite jamo any more. Above > all, it turned out from my experiment on a windows machine that recent > version of Uniscribe does not compose jamo elements to a composte > jamo, even for those jamos which were not available before Unicode > 5.2. So I think it is better for us to ignore the section "Compose > Old Hangul Jamo combinations" and its Appendix B altogether. > > Instead, Uniscribe sets boundaries between syllable blocks as I > mentioned before. As we know that all the single and composte jamos > have their own code points, the rule to identify syllable blocks is > quite simple now: > L V T? M?
Today I have tested Uniscribe again. It turned out that Uniscribe does not simply apply this rule to identify syllable blocks. When a jamo sequence is a candidate to be composed to a composite jamo newly added to Unicode 5.2, Uniscribe considers it as a single jamo, though it does *not* actually compose the sequence to the composite jamo. As this may be a little confusing, let us take some examples. For each input text of left side, Uniscribe sets boundaries as the right side: <U+1100 U+1100 U+1161> => <U+1100 | U+1100 U+1161> => <U+1100 | U+AC00> <U+1100 U+1100> is a sequence which can be concatenated to <U+1101>. However, Uniscribe divides them into two syllable blocks, because U+1101 has been registered to Unicode from its very early versions. <U+1103 U+1106 U+1161> => <U+1103 U+1106 U+1161> <U+1103 U+1106> can be concatenated to <U+A960>, a newly registred jamo by Unicode version 5.2. In this case Uniscribe considers them as a single composite jamo and so does not set boundary between U+1103 and U+1106. Notice that Uniscribe does not actually compose these element jamos to U+A960, just allowing font features do their job. <U+1100 U+1161 U+11AB U+11AB> => <U+1100 U+1161 U+11AB U+11AB> In a similar fasion, as <U+11AB U+11AB> can be concatenated to <U+11FF> which is a newly added jamo, Uniscribe does not divide syllable blocks in-between. This policy of Uniscribe seems to be a little complicated. But it must be quite resonable as it also supports old documents which had been written before Unicode 5.2 was introduced, ensuring backward compatibility. > where L is leading consonants including Choseong filler; V is medial > vowel including Jungseong filler; T is trailing consonants; M is > Hangul Tone Marks (U+302E U+302F); and ? meands zero or one occurrence > of specified character. Before or after these jamo sequence, > uniscribe seems to set boundaries. And what is important is that > Uniscribe composes jamos to syllable only when complete sequence of <L > V T?> matches precomposed Hangul syllable. In other words, <L V OT> > is not composed and Uniscribe passes the sequence intact to the OTLS > precess. > > Thanks a lot for your effort to support Hangul. > Best, > >> >> On 13-04-06 01:32 PM, Dohyun Kim wrote: >>> 2013/4/6 Behdad Esfahbod <[email protected]>: >>>> On 13-04-05 06:45 AM, Dohyun Kim wrote: >>>>> 2013/4/5 Dohyun Kim <[email protected]>: >>>>>> Sorry for the noise. >>>>>> I have booted on Windows machine and tested uniscribe a bit. My guess >>>>>> on how uniscribe works on Hangul is: >>>>>> >>>>>> 1. decompose hangul syllables to jamos >>>>>> >>>>>> 2. compose single jamos to composite jamo as possible as can be >>>>>> eg., U+1100 U+1100 => U+1101 >>>>>> Note: mapping table for this composition is available at >>>>>> ftp://ktug.org/ktug/hcr-lvt/composejamotojamo.map >>>>>> >>>>> >>>>> Well, after a bit more test, it turned out that this second process is >>>>> not what uniscribe does. Sorry for my wrong information. I have >>>>> guessed this on the basis of old unicode standard. Recently unicode >>>>> also does not recommend to use multiple single jamos to get composite >>>>> jamo. >>>>> >>>>> Instead, uniscribe inserts fillers (U+115F U+1160) around single >>>>> lonely jamo which do not make up syllable block. >>>> >>>> Interesting. So, for a lone T jamo, both 115F and 1160 are inserted? >>> >>> Yes, when fillers are inserted. But actually uniscribe does not seem >>> to insert fillers. Sorry for my immuture conclusion. Today I have >>> downloaded harfbuzz win32 binary and tested some jamo texts using >>> hb-shape. This utility gave me more accurate information than I could >>> obtain with the naked eye. Contrary to my expectation, the output of >>> hb-shape did not have any traces of fillers. So, it seems evident >>> that uniscribe does not insert fillers. And it seems also evident >>> that uniscribe sets boundaries between syllable blocks, so that >>> multiple single jamos could not be concatenated to composite jamo. >>> >>> Let us suppose an input text <U+1100 U+AC00 U+11F0>. I guess what >>> uniscribe does: >>> >>> 1. decompose syllables to jamos: we get <U+1100 U+1100 U+1161 U+11F0> >>> >>> 2. demarcate each syllable block by setting boundaries in-between: we >>> get <U+1100 | U+1100 U+1161 U+11F0> where | means syllable boundary. >>> Probably this is related to the so-called "cluster." Yesterday I >>> misconceived this boundary (maybe ZWNJ but I am not sure) as a filler. >>> BTW, according to the old standard, U+1100 U+1100 are concatenated to >>> U+1101, so the result will be a single syllable block <U+1101 U+1161 >>> U+11F0>. Nowadays we do not need this jamo-to-jamo composition, >>> because all the jamos known until today are now registerd since >>> unicode version 5.2. >>> >>> 3. try to re-compose jamos to syllablle letter. But as our sample >>> text matches the case of <L V OT>, nothing is converted. >>> >>> 4. apply font features: we get <U+1100 | U+1100.s U+1161.s U+11F0.s> >>> where ".s" means sustituted glyph. >>> >>> As I said before, we Koreans do not input text like <U+AC00 U+11F0> in >>> their practice. However, there remains some possibility that some >>> applications or libaries do pass to harfbuzz some unicode-normailized >>> text, in which case hafbuzz would give us incorrect result. So I >>> changed my mind, and now I suggest an implementation of hangul shaper. >>> It is not an urgent task, though; harfbuzz works quite well already. >>> However, we want harfbuzz as perfect as possible. >>> >>> Regards, >>> >>> >>>>>> 3. compose jamos to hangul syllable as possible as can be >>>>>> Note: this process complies with KSC 1026-1. In other words, jamo >>>>>> sequence <L V> in <L V OT> is *not* converted to LV, where L means >>>>>> leading consonant, V means medial vowel, OT means *old* trailing >>>>>> consonant (U+11C3..U+11FF U+D7CB..U+D7FB), and LV means Hangul >>>>>> syllable equivalent to L V. >>>>>> >>>>>> 4. apply opentype layout features >>>>>> >>>>>> It is somewhat complicated but gives perfect result. It satisfies >>>>>> both the Korean and Unicode standards. Nevertheless, what current >>>>>> hafbuzz does is quite excellent as well and I am satisfied with it. I >>>>>> am reporting just for reference. >>>>>> >>> >> >> -- >> behdad >> http://behdad.org/ > > > > -- > Dohyun Kim > College of Law, Dongguk University > Seoul, Republic of Korea -- Dohyun Kim College of Law, Dongguk University Seoul, Republic of Korea _______________________________________________ HarfBuzz mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/harfbuzz
