+roozbeh

On 13-10-18 04:52 PM, Khaled Hosny wrote:
> On Thu, Oct 17, 2013 at 10:05:20PM +0200, Behdad Esfahbod wrote:
>> Khaled,
>>
>> Here's what Roozbeh prepared:
>>
>> ===========================
>> Behdad,
>>
>> I did a very thorough search of both the Koran and the Unicode proposals for
>> the new Arabic characters for the last fifteen years or so.
>>
>> I could actually come up with a very simple algorithm:
>>
>> First, convert the input sequence to NFD.
>>
>> The order of the characters will be a bit messed up after this due to bad old
>> decisions in Unicode, and our goal is to make it clean. After this step, we
>> will have the traditional marks (ccc not in [220, 230]) at the very 
>> beginning,
>> with the newly encoded ones (ccc in [220, 230]) after them.
>>
>> Definition: MCM, defined here, is the modifier combining marks, which 
>> actually
>> modify a base letter (and also have ccc=220 or 230). That means that
>> traditional harakat come after them in logical order, but before them in NFD.
>> Here is the MCM set:
>>
>> 0654 ARABIC HAMZA ABOVE
>> 0655 ARABIC HAMZA BELOW
>> 0658 ARABIC MARK NOON GHUNNA
>> 06DC ARABIC SMALL HIGH SEEN
>> 06E3 ARABIC SMALL LOW SEEN
>> 06E7 ARABIC SMALL HIGH YEH
>> 06E8 ARABIC SMALL HIGH NOON
>> 08F3 ARABIC SMALL HIGH WAW
> 
> U+0653 ARABIC MADDAH ABOVE should be added to this list, see below.
> 
>> Following, is the order in which the combining marks after each base letter
>> should be read, for them to be in logical order (it could be used for both
>> determining rendering order, and backspacing):
>>
>> 1. The longest "consecutive" sequence of characters "at the beginning" the
>> ccc=220 part of the list that are in MCM;
>> 2. The longest "consecutive" sequence of characters "at the beginning" of the
>> ccc=230 part of the list that are in MCM;
>> 3. All the characters in the ccc=33 (shadda) part of the list;
>> 4. All the rest of the characters (in NFD order).
>>
>> Very obscure test data, just to demonstrate the algorithm:
>>
>> src: 0618 0619 064E 064F 0654 0658 0653 0654 0651 0656 0651 065C 0655 0650
>> ccc:   30   31   30   31  230  230  230  230   33  220   33  220  220   32
>> MCM:                      Yes  Yes       Yes                      Yes
>>
>> out: 0654 0658 0651 0651 0618 064E 0619 064F 0650 0656 065C 0655 0653 0654
>> ccc:  230  230   33   33   30   30   31   31   32  220  220  220  230  230
>> MCM:  Yes  Yes                                               Yes       Yes
> 
> I think the order of Hamza below is not right, I'd expect it to come at
> least before other below marks, regardless of whether there are other
> MCM marks in the sequence or not.
> 
>> Note that the algorithm guarantees canonical equivalence of the output and
>> input, and also guarantees the same result for all canonically equivalent 
>> strings.
>>
>> Also note that you cannot replace NFD with NFC in the algorithm, because of
>> Alef Madda Above: 0622=0627 0653. The result of the algorithm for <Alef 
>> Madda,
>> Superscript Alef> should be <Alef, Superscript Alef, Madda Above> 
>> (Superscript
>> Alef should always come before Madda above, the sequence <Fatha, Alef 
>> Maksura,
>> Madda> is quite common in the Koran). If not for the exception of Alef Madda
>> above, an NFC version of the algorithm would work fine and in the same way.
> 
> I disagree here, 0653 is actually a special form of Hamza and should be
> treated as other MCM marks. The madda used in Quran serves a quite
> different purpose and had its own code point; U+06E4 ARABIC SMALL HIGH
> MADDA. 
> 
>> Roozbeh
>> ===========================
>>
>> We think it's reasonable and will eventually implement something based on it.
>>  Please discuss.
>>
>> behdad
>>
>>
>> On 12-12-18 10:59 AM, Khaled Hosny wrote:
>>> On Tue, Dec 18, 2012 at 12:15:45AM -0500, Behdad Esfahbod wrote:
>>>> On 12-12-18 12:13 AM, Khaled Hosny wrote:
>>>>> As for madda, Jonathan is right; it should indeed follow other marks, I
>>>>> don’t know what I was thinking.
>>>>>
>>>>> Some testing with people working on texts with heavy use of marks,
>>>>> showed that U+065C and U+06EC should precede vowel marks (but still
>>>>> follow the hamza).
>>>>
>>>> Thanks Khaled,
>>>>
>>>> Do you mind compiling a total order for the Arabic marks so I can 
>>>> (blindly) go
>>>> ahead and implement?
>>>
>>> List below separated in groups ordered to the best of my knowledge,
>>> marks in each group should be ordered before following groups. The order
>>> inside each group is not important IMO but I kept them ordered by the
>>> existing combining classes.
>>>
>>> Regards,
>>> Khaled
>>>
>>> (the first field is the existing combining class)
>>>
>>> 220 U+0655  ◌ٕ      ARABIC HAMZA BELOW
>>> 220 U+065F  ◌ٟ      ARABIC WAVY HAMZA BELOW
>>> 230 U+0654  ◌ٔ      ARABIC HAMZA ABOVE
>>>
>>> 220 U+065C  ◌ٜ      ARABIC VOWEL SIGN DOT BELOW
>>> 230 U+06EC  ◌۬      ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
>>>
>>> 033 U+0651  ◌ّ      ARABIC SHADDA
>>> 230 U+06DF  ◌۟      ARABIC SMALL HIGH ROUNDED ZERO
>>> 230 U+06E0  ◌۠      ARABIC SMALL HIGH UPRIGHT RECTANGULAR ZERO
>>>
>>> 027 U+064B  ◌ً      ARABIC FATHATAN
>>> 027 U+08F0  ◌ࣰ      ARABIC OPEN FATHATAN
>>> 028 U+064C  ◌ٌ      ARABIC DAMMATAN
>>> 028 U+08F1  ◌ࣱ      ARABIC OPEN DAMMATAN
>>> 029 U+064D  ◌ٍ      ARABIC KASRATAN
>>> 029 U+08F2  ◌ࣲ      ARABIC OPEN KASRATAN
>>> 030 U+0618  ◌ؘ      ARABIC SMALL FATHA
>>> 030 U+064E  ◌َ      ARABIC FATHA
>>> 031 U+0619  ◌ؙ      ARABIC SMALL DAMMA
>>> 031 U+064F  ◌ُ      ARABIC DAMMA
>>> 032 U+061A  ◌ؚ      ARABIC SMALL KASRA
>>> 032 U+0650  ◌ِ      ARABIC KASRA
>>> 034 U+0652  ◌ْ      ARABIC SUKUN
>>> 230 U+06E1  ◌ۡ      ARABIC SMALL HIGH DOTLESS HEAD OF KHAH
>>> 230 U+0657  ◌ٗ      ARABIC INVERTED DAMMA
>>> 230 U+0658  ◌٘      ARABIC MARK NOON GHUNNA
>>> 230 U+0659  ◌ٙ      ARABIC ZWARAKAY
>>> 230 U+065A  ◌ٚ      ARABIC VOWEL SIGN SMALL V ABOVE
>>> 230 U+065B  ◌ٛ      ARABIC VOWEL SIGN INVERTED SMALL V ABOVE
>>> 230 U+065D  ◌ٝ      ARABIC REVERSED DAMMA
>>> 230 U+065E  ◌ٞ      ARABIC FATHA WITH TWO DOTS
>>>
>>> 035 U+0670  ◌ٰ      ARABIC LETTER SUPERSCRIPT ALEF
>>> 220 U+0656  ◌ٖ      ARABIC SUBSCRIPT ALEF
>>> 220 U+06ED  ◌ۭ      ARABIC SMALL LOW MEEM
>>> 230 U+06E2  ◌ۢ      ARABIC SMALL HIGH MEEM ISOLATED FORM
>>>
>>> 220 U+06EA  ◌۪      ARABIC EMPTY CENTRE LOW STOP
>>> 230 U+06EB  ◌۫      ARABIC EMPTY CENTRE HIGH STOP
>>>
>>> 220 U+06E3  ◌ۣ      ARABIC SMALL LOW SEEN
>>> 230 U+06E7  ◌ۧ      ARABIC SMALL HIGH YEH
>>> 230 U+06E8  ◌ۨ      ARABIC SMALL HIGH NOON
>>>
>>> 230 U+0653  ◌ٓ      ARABIC MADDAH ABOVE
>>> 230 U+06E4  ◌ۤ      ARABIC SMALL HIGH MADDA
>>>
>>> 230 U+0610  ◌ؐ      ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM
>>> 230 U+0611  ◌ؑ      ARABIC SIGN ALAYHE ASSALLAM
>>> 230 U+0612  ◌ؒ      ARABIC SIGN RAHMATULLAH ALAYHE
>>> 230 U+0613  ◌ؓ      ARABIC SIGN RADI ALLAHOU ANHU
>>> 230 U+0614  ◌ؔ      ARABIC SIGN TAKHALLUS
>>>
>>> 230 U+0615  ◌ؕ      ARABIC SMALL HIGH TAH
>>> 230 U+0616  ◌ؖ      ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH
>>> 230 U+0617  ◌ؗ      ARABIC SMALL HIGH ZAIN
>>> 230 U+06D6  ◌ۖ      ARABIC SMALL HIGH LIGATURE SAD WITH LAM WITH ALEF 
>>> MAKSURA
>>> 230 U+06D7  ◌ۗ      ARABIC SMALL HIGH LIGATURE QAF WITH LAM WITH ALEF 
>>> MAKSURA
>>> 230 U+06D8  ◌ۘ      ARABIC SMALL HIGH MEEM INITIAL FORM
>>> 230 U+06D9  ◌ۙ      ARABIC SMALL HIGH LAM ALEF
>>> 230 U+06DA  ◌ۚ      ARABIC SMALL HIGH JEEM
>>> 230 U+06DB  ◌ۛ      ARABIC SMALL HIGH THREE DOTS
>>> 230 U+06DC  ◌ۜ      ARABIC SMALL HIGH SEEN
>>>
>>
>> -- 
>> behdad
>> http://behdad.org/
> 

-- 
behdad
http://behdad.org/
_______________________________________________
HarfBuzz mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/harfbuzz

Reply via email to