On 2:03:08 pm 11/13/06 Kevin Atkinson <[EMAIL PROTECTED]> wrote: > On Mon, 13 Nov 2006, [EMAIL PROTECTED] wrote: [...] > > Linguistically, this consists > > of the consonant "ca" (U091A), and the conjunct "kra", à ¤Âà ¥Âà ¤° > > (U0915 + U094D + > > U0930), and the UTF-8 storage would be U091A U0915 U094D U0930. > > So how many "letters"? Is that 3 or 4? Is U094D considered a > "letter"?
That is 4 letters, including the initial consonant that is separate. The conjunct itself is three. > > > Now, any > > calculations of edit distance, such as swap, etc., should use the > > consonant "ca" and the conjunct "kra", not the individual Unicode > > characters. If for example, we operated on the individual > > characters, a swap might move the "halant" (U094D) ahead of the > > "ka" (U0915), making the character sequence U091A U094D U0915 > > U0930. As the "halant" is what is used to construct conjuncts, > > this makes a new conjunct, "chka", à ¤Âà ¥Âà ¤ (U091A > > + U094D + U0915), followed by the consonant "ra", à ¤° (U0930). > > This is not desirable, as a confusion of spelling would never > > arise between "chka" and "kra". > > So it is never the case you might want to substitute a letter in the > conjunct with another letter? I assume you would. I would also > assume that you would want to consider two conjuncts which are the > same except for one letter as closer than two completely different > conjuncts? Yes, it is desirable to substitute a letter in the conjunct with another letter, but the above example, where moving the halant changes the structure of the word is unlikely to be a likely mistake. I have to think this through further, but maybe an edit distance mechanism that keeps the position of the halant immutable might be the way to go. > Also how likely is it that the user will swap two glyphs? Not very likely as a typing error. However, it is quite likely that one syllable might be substituted mentally for another while thinking about what to write. > Also if you every want to implement any sort of true soundslike I > would think you would want to work with letters not syllables. I will need more advice from you on this, but I would have thought that syllables are better to work with, especially as most Indian languages are spelt phonetically. > > Hope this makes more sense. I will come up with a more detailed > > write-up including a description of conjuncts, and why one should > > use syllables, rather than characters, as the basic units for > > Indian language spellchecking. Some of these issues, maybe most of > > them, can be made up for by appropriate soundslike rules. I really > > should try out some quantitative tests first. > > Possible but you really need a "looks like" rather than a > "soundslike". I agree if you want to unique represent each syllable > you may run out of symbols to use. > > However, it may me better to just use a syllable aware edit distance. That is a very good suggestion, and I have to try it out. > I now understand the issue. However, I think that the fact that > Aspell is 8-bit internally is a very small factor. Converting Aspell > to be 16-bit internally will not magically fix this issue. I don't > even think it will make it significantly easier to solve. Yes, the 8-bit size is not so much the issue. It is more that if the internal representation were Unicode, it would be easier to use existing libraries to parse syllables. However, a workaround is probably not too difficult. > I do believe to truly handle this situation well some modifications > will need to be made to Aspell. I suggest you start studying > readonly_ws.cpp and suggest.cpp. I while ago I wrote some docs on > how Aspell works: http://lists.gnu.org/archive/html/aspell-devel/20 > 05-09/msg00007.html > http://lists.gnu.org/archive/html/aspell-devel/2005-10/msg00000.htm > l > which may be helpful. Thanks. These look useful. > I will get back to you latter with some ideas on how to approach this > issue. If you already thought of some please share them. I am realising that linguistically I am probably in over my depth with Hindi. However, we are meeting this Sat., along with some literary Hindi folk, and I am talking to experts in other Indian languages, to plan out an approach. I will certainly make these available, probably on a Wiki page. Thanks for all the interest that you have shown in this. Regards, Gora _______________________________________________ Aspell-devel mailing list Aspell-devel@gnu.org http://lists.gnu.org/mailman/listinfo/aspell-devel