On 11:42:07 am 11/13/06 Kevin Atkinson <[EMAIL PROTECTED]> wrote: [...] > The explanation below did nothing to explain why you want to be able > to store the "kra" conjunct, especially since the conjunct doesn't > exist in Unicode. [...]
Sorry, maybe I am assuming too much of a familiarity with Indian languages. A conjunct in Hindi, such as "kra", should be treated as a new entity, on par with base consonants that exist in Unicode. Let us take the word "chakra", à¤à¤à¥à¤°, for example. I am not sure if you can see the proper rendering, but there should be two glyphs. Linguistically, this consists of the consonant "ca" (U091A), and the conjunct "kra", à¤à¥à¤° (U0915 + U094D + U0930), and the UTF-8 storage would be U091A U0915 U094D U0930. Now, any calculations of edit distance, such as swap, etc., should use the consonant "ca" and the conjunct "kra", not the individual Unicode characters. If for example, we operated on the individual characters, a swap might move the "halant" (U094D) ahead of the "ka" (U0915), making the character sequence U091A U094D U0915 U0930. As the "halant" is what is used to construct conjuncts, this makes a new conjunct, "chka", à¤à¥à¤ (U091A + U094D + U0915), followed by the consonant "ra", र (U0930). This is not desirable, as a confusion of spelling would never arise between "chka" and "kra". If instead, one operated on conjuncts (actually, the operations need to be on syllables), a swap would end up looking like the conjunct "kra" followed by the consonant "ca", with the storage sequence being U0915 U094D U0930 U091A. Hope this makes more sense. I will come up with a more detailed write-up including a description of conjuncts, and why one should use syllables, rather than characters, as the basic units for Indian language spellchecking. Some of these issues, maybe most of them, can be made up for by appropriate soundslike rules. I really should try out some quantitative tests first. Regards, Gora P.S. I was also toying with the idea of writing an aspell UNO component to enable usage from OpenOffice. I see that there has been some discussion on this earlier. Do you think that such a component would still be useful, or has the integration of Hunspell into OpenOffice made something like this unnecessary? _______________________________________________ Aspell-devel mailing list Aspell-devel@gnu.org http://lists.gnu.org/mailman/listinfo/aspell-devel