On Friday 25 June 2004 20:55, Mete Kural wrote: > > After reading your response I can now clearly > understand the cause of the communication gap between > us. Your proposal does not take into account the > concept of graphemes vs. allographs.
No, I think you still have some confusion regarding the difference between Small Alef and Superscript Alef but I can understand that this is a confusing thing so I think it would be best if you consulted some Qur'an scholars. > For that reason while we are proposing that a single code point for > superscript/dagger/small alef is appropriate for all > instances of superscript/dagger/small alef because > there is really only one superscript/dagger/small alef > "grapheme", No, Superscript Alef is a vowel sign and is encoded as a NSM in Unicode and thus is ONLY suitable for attaching to base characters not to be used as a standalone character and it is also very different from the Arabic alphabetical letter Alef. Small Alef is essentially an Alef and cannot be considered a vowel sign and cannot be encoded as a NSM in Unicode and cannot be attached to any base character and MUST be used as a standalone character which other NSM's can be attached to. The difference is obvious, these are TWO very different graphemes, one of them is a vowel sign and the other is an Arabic letter. How can one ever consider them the same grapheme? > you are proposing two codes for > superscript/dagger/small alef because of there are two > superscript/dagger/small alef "allographs". This is > more of a philosophical problem between us in regards > to encoding theory which is not easily solvable within > one mailing list thread. It's not a philosophical problem (and cannot be) between any two persons. I'm not telling you my opinion about something, I'm telling you facts and rules that Arabic have and this is really not something to argue with and if you must then it still has to be discussed and approved by organizations responsible for standardizing the Arabic language. It's very simple, Superscript Alef is a vowel sign and Small Alef is an Arabic letter. I know it's confusing because the name of the vowel sign but that's why you can see in the Unicode Standard right below the name the sentence "a vowel sign, despite the name". The ones who added this were wise enough to clear this confusion. > You also have some > justification for your proposal by saying that what > you are proposing is consistent with the rest of the > Unicode Arabic block. I agree with you here, yes it > may be consistent with the rest of the Unicode Arabic > block, but the Unicode Arabic block is not based on a > purely graphemic encoding scheme either. "The fact > that the code is bad is no excuse to make it worse." > Even if it was bad, adding a new character won't make it worse, it will make it consistent. "A complete Arabic block that is bad is MUCH BETTER than an incomplete Arabic block that is bad" And even if it will make it worse, it's still not worse than accepting: ARABIC LETTER REH WITH HAMZA ABOVE http://www.unicode.org/alloc/Pipeline.html (accepted in 2004-Feb-04) > > As you can notice there is no SMALL HIGH WAW > > because the damma looks > > exactly like a SMALL HIGH WAW, so there is no need > > for another character > > for SMALL HIGH WAW and instead damma is used. > > They share the same look property and even > > pronouncation but their name is > > different because one is used as a vowel, and the > > other to denote a missing > > WAW. > > The above also points to a misunderstanding of > graphemes vs. allographs. I think you have some confusion regarding graphemes and allographs. "The fact that two graphemes have similar looking glyphs doesn't meant that they are allographs" As a proof, I'm asking you about the reason why you think they are allographs? (I think the answer would be "Because their glyphs are similar" and that proves my point) You should understand that a vowel sign and an Arabic letter cannot in anyway be considered allographs. For example: + A Jeem and a kasra cannot be considered allographs, they are graphemes. + A Waw and a damma cannot be considered allographs, they are graphemes. + An Alef and a "Superscript Alef - a vowel sign despite its name" cannot be considered allographs, they are graphemes. + A "Small Alef - used to replace a missing Alef" and a "Superscript Alef - a vowel sign despite its name" cannot be considered allographs, they are graphemes. > If you are going to use two > seperate codepoints to encode superscript/dagger/small > alef, one for its usage on top of alef maksura and > another for its usage in words like haadha, dhaalika > and bura'aa'u, The one on top of the Alef Maksura is a vowel sign. The other one is an Arabic letter. (Notice that you couldn't use "on top of" here, that is because Arabic letters CANNOT be used on top of other Arabic letters) Any attempt to encode them using the same codepoint is not only illogical but also considered misspelling. > I would tell you that you should use > the same codepoint in haadha, dhaalika and bura'aa'u. I can't recognize the words "haadha" or "dhaalika" please use Arabic letters or at least transliterate the words in a meaningful manner so that I can understand what you mean. I don't want to leave the strong point I'm raising but here is another one for the record. How can you use the same codepoint for the two words: ØÙÙÙ and ØØØØØØ (with the first Alef in the second word meaning the Small Alef) Superscript Alef is a vowel sign and is defined in Unicode as a NSM that is attached to base characters. Thus if you used the same codepoint existing already, it fails horribly: ØØØÙØØ To compare them: ØØØØØØ ØØØÙØØ See, they are different in spelling, one has an Alef and the other has a vowel sign on top of the hamza which is completely wrong. > Do not use a different codepoint for bura'aa'u just > because it appears lower than the one used in haadha > and dhaalika. The difference between dhaalika and > bura'aa'u is only at the allograph level, it is really > the same grapheme. Otherwise you make an existing > problem even worse. > Ah you mean ØØÙÙ and ØØØØØØ You are confusing here, both of them are Small Alef not the vowel sign and there's not any differences between them at all even at the allograph level, they are at the same Y-Axis position no one of them is lower than the other. But I think you meant words like ØÙÙÙ and ØØØØØØ First, I will talk about the expected rendering behavior: One of them doesn't have to be lower than the other (Actually in some masahef they are placed at the same Y-Axis position) and thus depends on the height of the base character the superscript alef is on top of and the various symbols that may be already on top of that base character. But one of them MUST be on top of the previous character and the other MUST NOT be on top of the previous character and MUST have its own spacing. Second, I will talk about the expected meaning: One of them is a vowel sign and the other is the Arabic letter Alef. I think that this is more than sufficient justification to encoding them as different characters but I will give you an example where it's clear that using the same code point for them is crude. Assume that I'm developing a Qur'an application and let's say that I want to implement a good searching algorithm for it but there is a problem, the user is expected to type the word without the various Qur'anic symbols and even without harakat and vowel signs. There are two solutions: 1. Add a separate text for searching which is encoded without Qur'anic symbols and vowel signs. (This is a bad solution) 2. Use the concept of Normalization where you do various tasks including: a. strip vowel signs (This includes the removal of all "superscipt alefs") b. add missing letters by replacing the small letters by regular ones (This includes replacing all "Small Alefs" by Alef) Let's assume we go with solution (2) which is the natural one. Let's assume we are applying the algorithm for the two words (which are encoded using the same codepoint as you suggest): ØÙÙÙ - ØØØÙØØ Applying a,b in the order "a then b" Applying a, the words become ØÙÙ - ØØØØØ Applying b, the words become ØÙÙ - ØØØØØ (Notice that applying 'b' did nothing) and the result normalized words are ØÙÙ - ØØØØØ But the correct normalized words are ØÙÙ - ØØØØØ The word ØØØØØ is misspelled here as you see and thus a search for the word ØØØØØØ fails miserably although it exists in the Qur'an. Let's say we will apply a,b in the order "b then a" Applying b, the words become ØÙÙØ - ØØØØØØ Applying a, the words become ØÙÙØ - ØØØØØØ (Notice that applying 'a' did nothing) and the result normalized words are ØÙÙØ - ØØØØØØ But the correct normalized words are ØÙÙ - ØØØØØØ The word ØÙÙ is misspelled here as you see and thus a search for the word ØÙÙ using "Whole Words Matching" fails miserably although it exists in the Qur'an as a whole word I think that this should make it very clear why they should be encoded as different characters even if you think they are only allographs (They may btw have the same glyph in some masahef but still they are different characters and different graphemes) > So in conclusion I would like to tell you that we will > not endorse a proposal to add a new additional > superscript/dagger/small alef codepoint in a joint > proposal. If you wish to propose this please do it in > a seperate individual proposal. > > Kind Regards, > Mete > Mete, I don't know why I'm getting the feeling that you are against it period. I think you agree that we want the best idea to make its way to the proposal, expressions like "we will not endorse a proposal to..." is really not appropriate, all of this needs to be discussed and the best idea wins not just sticking to an idea and insisting on it without considering other factors. I would please ask you to at least read the point about Normalization of the Qur'an text in this post carefully and comment on it. -- Mohammed Yousif Egypt _______________________________________________ General mailing list [EMAIL PROTECTED] http://lists.arabeyes.org/mailman/listinfo/general

