Ok, I'm more confused now :). I'll find some time to put something together and take it from there. In the mean time, if you can compile a list of sequences that would test all the corner cases you can think of, that would immensely help with the implementation.
Thanks, b On 13-04-10 02:45 AM, Dohyun Kim wrote: > 2013/4/10 Dohyun Kim <[email protected]>: >> 2013/4/10 Behdad Esfahbod <[email protected]>: >>> Hi, >>> >>> Ok, what you describe sounds very close to the OpenType spec: >>> >>> http://www.microsoft.com/typography/otfntdev/hangulot/ >>> >>> and what the ICU Layout Hangul shaper does. >>> >>> The one part I don't understand is the section "Compose Old Hangul Jamo >>> combinations" under: >>> >>> http://www.microsoft.com/typography/otfntdev/hangulot/shaping.htm >>> >>> I can't make sense of that part, since Appendix B does not list what the >>> jamos >>> compose to. >>> >>> Please review those documents and share any insights you may have. I'll go >>> ahead with implementing a shaper then. >>> >> >> This Hangul Opentype spec from microsoft is quite outdated. It was >> written in 2003, ten years ago from now. In the meantime, KS X 1026-1 >> and Unicode 5.2 have been released in 2007 and 2009 respectively. >> Unicode 5.2 has assigned code points to a number of new jamos, which >> are U+115A..U+115E, U+11A3..U+11A7, U+11FA..U+11FF, U+A960..U+A97C, >> U+D7B0..U+D7C6, and U+D7CB..U+D7FB. Consequently, those items in >> Appendix B that you pointed out are now all have their unicode code >> points. For instance, <U+1102 U+1109> has now become <U+115B>. >> Before Unicode 5.2, Koreans could not help writing down <U+1102 >> U+1109> to represent the composite jamo which is composed of Choseong >> Nieun and Choseong Sios. Now it is a story of past. Anyway, you can >> find full list of composite jamos with their elements composing them >> at ftp://ktug.org/ktug/hcr-lvt/composejamotojamo.map which I have >> shown before as a reference. >> >> Moreover, the microsoft spec has incorrect informations on several >> points. The section "Compose Old Hangul Jamo combinations" is one of >> them. This kind of jamo composition could not be done at pre-OTLS >> stage brefore Unicode 5.2 was introduced, as there was no code points >> of composed jamos at that time. Jamo-to-jamo composition could be >> done only at the stage of applying "ccmp" font feature. Now we have >> all composite jamos registered to Unicode, so a shaping engine can do >> this composition before applying font features. However, this kind of >> composition is contrary to the spec of KS X 1026-1. Section 5.3 of >> this spec says that "two or more code positions of simple letters >> cannot be concatenated to represent a single complex letter." >> Certainly, this concatenation is allowed according to the Unicode >> standard, though not recommended since the release of version 5.2. >> Yes, we have just encountered another discrepancy between local and >> global standards. But, in our pratice, Koreans do not input >> decomposed jamos to represent a single composite jamo any more. Above >> all, it turned out from my experiment on a windows machine that recent >> version of Uniscribe does not compose jamo elements to a composte >> jamo, even for those jamos which were not available before Unicode >> 5.2. So I think it is better for us to ignore the section "Compose >> Old Hangul Jamo combinations" and its Appendix B altogether. >> >> Instead, Uniscribe sets boundaries between syllable blocks as I >> mentioned before. As we know that all the single and composte jamos >> have their own code points, the rule to identify syllable blocks is >> quite simple now: >> L V T? M? > > Today I have tested Uniscribe again. It turned out that Uniscribe > does not simply apply this rule to identify syllable blocks. When a > jamo sequence is a candidate to be composed to a composite jamo newly > added to Unicode 5.2, Uniscribe considers it as a single jamo, though > it does *not* actually compose the sequence to the composite jamo. As > this may be a little confusing, let us take some examples. For each > input text of left side, Uniscribe sets boundaries as the right side: > > <U+1100 U+1100 U+1161> => <U+1100 | U+1100 U+1161> => <U+1100 | U+AC00> > > <U+1100 U+1100> is a sequence which can be concatenated to <U+1101>. > However, Uniscribe divides them into two syllable blocks, because > U+1101 has been registered to Unicode from its very early versions. > > <U+1103 U+1106 U+1161> => <U+1103 U+1106 U+1161> > > <U+1103 U+1106> can be concatenated to <U+A960>, a newly registred > jamo by Unicode version 5.2. In this case Uniscribe considers them as > a single composite jamo and so does not set boundary between U+1103 > and U+1106. Notice that Uniscribe does not actually compose these > element jamos to U+A960, just allowing font features do their job. > > <U+1100 U+1161 U+11AB U+11AB> => <U+1100 U+1161 U+11AB U+11AB> > > In a similar fasion, as <U+11AB U+11AB> can be concatenated to > <U+11FF> which is a newly added jamo, Uniscribe does not divide > syllable blocks in-between. > > This policy of Uniscribe seems to be a little complicated. But it > must be quite resonable as it also supports old documents which had > been written before Unicode 5.2 was introduced, ensuring backward > compatibility. > > >> where L is leading consonants including Choseong filler; V is medial >> vowel including Jungseong filler; T is trailing consonants; M is >> Hangul Tone Marks (U+302E U+302F); and ? meands zero or one occurrence >> of specified character. Before or after these jamo sequence, >> uniscribe seems to set boundaries. And what is important is that >> Uniscribe composes jamos to syllable only when complete sequence of <L >> V T?> matches precomposed Hangul syllable. In other words, <L V OT> >> is not composed and Uniscribe passes the sequence intact to the OTLS >> precess. >> >> Thanks a lot for your effort to support Hangul. >> Best, >> >>> >>> On 13-04-06 01:32 PM, Dohyun Kim wrote: >>>> 2013/4/6 Behdad Esfahbod <[email protected]>: >>>>> On 13-04-05 06:45 AM, Dohyun Kim wrote: >>>>>> 2013/4/5 Dohyun Kim <[email protected]>: >>>>>>> Sorry for the noise. >>>>>>> I have booted on Windows machine and tested uniscribe a bit. My guess >>>>>>> on how uniscribe works on Hangul is: >>>>>>> >>>>>>> 1. decompose hangul syllables to jamos >>>>>>> >>>>>>> 2. compose single jamos to composite jamo as possible as can be >>>>>>> eg., U+1100 U+1100 => U+1101 >>>>>>> Note: mapping table for this composition is available at >>>>>>> ftp://ktug.org/ktug/hcr-lvt/composejamotojamo.map >>>>>>> >>>>>> >>>>>> Well, after a bit more test, it turned out that this second process is >>>>>> not what uniscribe does. Sorry for my wrong information. I have >>>>>> guessed this on the basis of old unicode standard. Recently unicode >>>>>> also does not recommend to use multiple single jamos to get composite >>>>>> jamo. >>>>>> >>>>>> Instead, uniscribe inserts fillers (U+115F U+1160) around single >>>>>> lonely jamo which do not make up syllable block. >>>>> >>>>> Interesting. So, for a lone T jamo, both 115F and 1160 are inserted? >>>> >>>> Yes, when fillers are inserted. But actually uniscribe does not seem >>>> to insert fillers. Sorry for my immuture conclusion. Today I have >>>> downloaded harfbuzz win32 binary and tested some jamo texts using >>>> hb-shape. This utility gave me more accurate information than I could >>>> obtain with the naked eye. Contrary to my expectation, the output of >>>> hb-shape did not have any traces of fillers. So, it seems evident >>>> that uniscribe does not insert fillers. And it seems also evident >>>> that uniscribe sets boundaries between syllable blocks, so that >>>> multiple single jamos could not be concatenated to composite jamo. >>>> >>>> Let us suppose an input text <U+1100 U+AC00 U+11F0>. I guess what >>>> uniscribe does: >>>> >>>> 1. decompose syllables to jamos: we get <U+1100 U+1100 U+1161 U+11F0> >>>> >>>> 2. demarcate each syllable block by setting boundaries in-between: we >>>> get <U+1100 | U+1100 U+1161 U+11F0> where | means syllable boundary. >>>> Probably this is related to the so-called "cluster." Yesterday I >>>> misconceived this boundary (maybe ZWNJ but I am not sure) as a filler. >>>> BTW, according to the old standard, U+1100 U+1100 are concatenated to >>>> U+1101, so the result will be a single syllable block <U+1101 U+1161 >>>> U+11F0>. Nowadays we do not need this jamo-to-jamo composition, >>>> because all the jamos known until today are now registerd since >>>> unicode version 5.2. >>>> >>>> 3. try to re-compose jamos to syllablle letter. But as our sample >>>> text matches the case of <L V OT>, nothing is converted. >>>> >>>> 4. apply font features: we get <U+1100 | U+1100.s U+1161.s U+11F0.s> >>>> where ".s" means sustituted glyph. >>>> >>>> As I said before, we Koreans do not input text like <U+AC00 U+11F0> in >>>> their practice. However, there remains some possibility that some >>>> applications or libaries do pass to harfbuzz some unicode-normailized >>>> text, in which case hafbuzz would give us incorrect result. So I >>>> changed my mind, and now I suggest an implementation of hangul shaper. >>>> It is not an urgent task, though; harfbuzz works quite well already. >>>> However, we want harfbuzz as perfect as possible. >>>> >>>> Regards, >>>> >>>> >>>>>>> 3. compose jamos to hangul syllable as possible as can be >>>>>>> Note: this process complies with KSC 1026-1. In other words, jamo >>>>>>> sequence <L V> in <L V OT> is *not* converted to LV, where L means >>>>>>> leading consonant, V means medial vowel, OT means *old* trailing >>>>>>> consonant (U+11C3..U+11FF U+D7CB..U+D7FB), and LV means Hangul >>>>>>> syllable equivalent to L V. >>>>>>> >>>>>>> 4. apply opentype layout features >>>>>>> >>>>>>> It is somewhat complicated but gives perfect result. It satisfies >>>>>>> both the Korean and Unicode standards. Nevertheless, what current >>>>>>> hafbuzz does is quite excellent as well and I am satisfied with it. I >>>>>>> am reporting just for reference. >>>>>>> >>>> >>> >>> -- >>> behdad >>> http://behdad.org/ >> >> >> >> -- >> Dohyun Kim >> College of Law, Dongguk University >> Seoul, Republic of Korea > > > > -- > Dohyun Kim > College of Law, Dongguk University > Seoul, Republic of Korea > -- behdad http://behdad.org/ _______________________________________________ HarfBuzz mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/harfbuzz
