+roozbeh On 13-10-18 04:52 PM, Khaled Hosny wrote: > On Thu, Oct 17, 2013 at 10:05:20PM +0200, Behdad Esfahbod wrote: >> Khaled, >> >> Here's what Roozbeh prepared: >> >> =========================== >> Behdad, >> >> I did a very thorough search of both the Koran and the Unicode proposals for >> the new Arabic characters for the last fifteen years or so. >> >> I could actually come up with a very simple algorithm: >> >> First, convert the input sequence to NFD. >> >> The order of the characters will be a bit messed up after this due to bad old >> decisions in Unicode, and our goal is to make it clean. After this step, we >> will have the traditional marks (ccc not in [220, 230]) at the very >> beginning, >> with the newly encoded ones (ccc in [220, 230]) after them. >> >> Definition: MCM, defined here, is the modifier combining marks, which >> actually >> modify a base letter (and also have ccc=220 or 230). That means that >> traditional harakat come after them in logical order, but before them in NFD. >> Here is the MCM set: >> >> 0654 ARABIC HAMZA ABOVE >> 0655 ARABIC HAMZA BELOW >> 0658 ARABIC MARK NOON GHUNNA >> 06DC ARABIC SMALL HIGH SEEN >> 06E3 ARABIC SMALL LOW SEEN >> 06E7 ARABIC SMALL HIGH YEH >> 06E8 ARABIC SMALL HIGH NOON >> 08F3 ARABIC SMALL HIGH WAW > > U+0653 ARABIC MADDAH ABOVE should be added to this list, see below. > >> Following, is the order in which the combining marks after each base letter >> should be read, for them to be in logical order (it could be used for both >> determining rendering order, and backspacing): >> >> 1. The longest "consecutive" sequence of characters "at the beginning" the >> ccc=220 part of the list that are in MCM; >> 2. The longest "consecutive" sequence of characters "at the beginning" of the >> ccc=230 part of the list that are in MCM; >> 3. All the characters in the ccc=33 (shadda) part of the list; >> 4. All the rest of the characters (in NFD order). >> >> Very obscure test data, just to demonstrate the algorithm: >> >> src: 0618 0619 064E 064F 0654 0658 0653 0654 0651 0656 0651 065C 0655 0650 >> ccc: 30 31 30 31 230 230 230 230 33 220 33 220 220 32 >> MCM: Yes Yes Yes Yes >> >> out: 0654 0658 0651 0651 0618 064E 0619 064F 0650 0656 065C 0655 0653 0654 >> ccc: 230 230 33 33 30 30 31 31 32 220 220 220 230 230 >> MCM: Yes Yes Yes Yes > > I think the order of Hamza below is not right, I'd expect it to come at > least before other below marks, regardless of whether there are other > MCM marks in the sequence or not. > >> Note that the algorithm guarantees canonical equivalence of the output and >> input, and also guarantees the same result for all canonically equivalent >> strings. >> >> Also note that you cannot replace NFD with NFC in the algorithm, because of >> Alef Madda Above: 0622=0627 0653. The result of the algorithm for <Alef >> Madda, >> Superscript Alef> should be <Alef, Superscript Alef, Madda Above> >> (Superscript >> Alef should always come before Madda above, the sequence <Fatha, Alef >> Maksura, >> Madda> is quite common in the Koran). If not for the exception of Alef Madda >> above, an NFC version of the algorithm would work fine and in the same way. > > I disagree here, 0653 is actually a special form of Hamza and should be > treated as other MCM marks. The madda used in Quran serves a quite > different purpose and had its own code point; U+06E4 ARABIC SMALL HIGH > MADDA. > >> Roozbeh >> =========================== >> >> We think it's reasonable and will eventually implement something based on it. >> Please discuss. >> >> behdad >> >> >> On 12-12-18 10:59 AM, Khaled Hosny wrote: >>> On Tue, Dec 18, 2012 at 12:15:45AM -0500, Behdad Esfahbod wrote: >>>> On 12-12-18 12:13 AM, Khaled Hosny wrote: >>>>> As for madda, Jonathan is right; it should indeed follow other marks, I >>>>> don’t know what I was thinking. >>>>> >>>>> Some testing with people working on texts with heavy use of marks, >>>>> showed that U+065C and U+06EC should precede vowel marks (but still >>>>> follow the hamza). >>>> >>>> Thanks Khaled, >>>> >>>> Do you mind compiling a total order for the Arabic marks so I can >>>> (blindly) go >>>> ahead and implement? >>> >>> List below separated in groups ordered to the best of my knowledge, >>> marks in each group should be ordered before following groups. The order >>> inside each group is not important IMO but I kept them ordered by the >>> existing combining classes. >>> >>> Regards, >>> Khaled >>> >>> (the first field is the existing combining class) >>> >>> 220 U+0655 ◌ٕ ARABIC HAMZA BELOW >>> 220 U+065F ◌ٟ ARABIC WAVY HAMZA BELOW >>> 230 U+0654 ◌ٔ ARABIC HAMZA ABOVE >>> >>> 220 U+065C ◌ٜ ARABIC VOWEL SIGN DOT BELOW >>> 230 U+06EC ◌۬ ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE >>> >>> 033 U+0651 ◌ّ ARABIC SHADDA >>> 230 U+06DF ◌۟ ARABIC SMALL HIGH ROUNDED ZERO >>> 230 U+06E0 ◌۠ ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO >>> >>> 027 U+064B ◌ً ARABIC FATHATAN >>> 027 U+08F0 ◌ࣰ ARABIC OPEN FATHATAN >>> 028 U+064C ◌ٌ ARABIC DAMMATAN >>> 028 U+08F1 ◌ࣱ ARABIC OPEN DAMMATAN >>> 029 U+064D ◌ٍ ARABIC KASRATAN >>> 029 U+08F2 ◌ࣲ ARABIC OPEN KASRATAN >>> 030 U+0618 ◌ؘ ARABIC SMALL FATHA >>> 030 U+064E ◌َ ARABIC FATHA >>> 031 U+0619 ◌ؙ ARABIC SMALL DAMMA >>> 031 U+064F ◌ُ ARABIC DAMMA >>> 032 U+061A ◌ؚ ARABIC SMALL KASRA >>> 032 U+0650 ◌ِ ARABIC KASRA >>> 034 U+0652 ◌ْ ARABIC SUKUN >>> 230 U+06E1 ◌ۡ ARABIC SMALL HIGH DOTLESS HEAD OF KHAH >>> 230 U+0657 ◌ٗ ARABIC INVERTED DAMMA >>> 230 U+0658 ◌٘ ARABIC MARK NOON GHUNNA >>> 230 U+0659 ◌ٙ ARABIC ZWARAKAY >>> 230 U+065A ◌ٚ ARABIC VOWEL SIGN SMALL V ABOVE >>> 230 U+065B ◌ٛ ARABIC VOWEL SIGN INVERTED SMALL V ABOVE >>> 230 U+065D ◌ٝ ARABIC REVERSED DAMMA >>> 230 U+065E ◌ٞ ARABIC FATHA WITH TWO DOTS >>> >>> 035 U+0670 ◌ٰ ARABIC LETTER SUPERSCRIPT ALEF >>> 220 U+0656 ◌ٖ ARABIC SUBSCRIPT ALEF >>> 220 U+06ED ◌ۭ ARABIC SMALL LOW MEEM >>> 230 U+06E2 ◌ۢ ARABIC SMALL HIGH MEEM ISOLATED FORM >>> >>> 220 U+06EA ◌۪ ARABIC EMPTY CENTRE LOW STOP >>> 230 U+06EB ◌۫ ARABIC EMPTY CENTRE HIGH STOP >>> >>> 220 U+06E3 ◌ۣ ARABIC SMALL LOW SEEN >>> 230 U+06E7 ◌ۧ ARABIC SMALL HIGH YEH >>> 230 U+06E8 ◌ۨ ARABIC SMALL HIGH NOON >>> >>> 230 U+0653 ◌ٓ ARABIC MADDAH ABOVE >>> 230 U+06E4 ◌ۤ ARABIC SMALL HIGH MADDA >>> >>> 230 U+0610 ◌ؐ ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM >>> 230 U+0611 ◌ؑ ARABIC SIGN ALAYHE ASSALLAM >>> 230 U+0612 ◌ؒ ARABIC SIGN RAHMATULLAH ALAYHE >>> 230 U+0613 ◌ؓ ARABIC SIGN RADI ALLAHOU ANHU >>> 230 U+0614 ◌ؔ ARABIC SIGN TAKHALLUS >>> >>> 230 U+0615 ◌ؕ ARABIC SMALL HIGH TAH >>> 230 U+0616 ◌ؖ ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH >>> 230 U+0617 ◌ؗ ARABIC SMALL HIGH ZAIN >>> 230 U+06D6 ◌ۖ ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF >>> MAKSURA >>> 230 U+06D7 ◌ۗ ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF >>> MAKSURA >>> 230 U+06D8 ◌ۘ ARABIC SMALL HIGH MEEM INITIAL FORM >>> 230 U+06D9 ◌ۙ ARABIC SMALL HIGH LAM ALEF >>> 230 U+06DA ◌ۚ ARABIC SMALL HIGH JEEM >>> 230 U+06DB ◌ۛ ARABIC SMALL HIGH THREE DOTS >>> 230 U+06DC ◌ۜ ARABIC SMALL HIGH SEEN >>> >> >> -- >> behdad >> http://behdad.org/ >
-- behdad http://behdad.org/ _______________________________________________ HarfBuzz mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/harfbuzz
