Sorry for the delay, I've tried generating a sample wordlist and it was fine, I don't really know why we assumed that aspell won't work with Arabic and M. Elzubeir started the Duali project and I forked Duali and coded Baghdad.
Well, I have a working spell checker implementation that is using the Duali data set which is originally the Buckwalter data set. I can say that the set is not really accurate, It was identifying some misspelled words as correct and it was failing to identify correct words, While I can accept it not to identify all the correct words. I can't accept it saying that some misspelled words are correct. We have a lot of old words in Arabic that are not really used and being a native Arabic speaker, I don't think that it's a good idea to list them in our wordlist. If you use such words then you don't need a spell checker because definitely your language background is solid enough :-) I don't like the Buckwalter data set because it contains some incorrect words "of course it might be a problem in my implementation but it might be a problem with the data set itself" and because no one really had a look at it and removed old words. My idea was to generate a somehow authentic data set but I don't have enough *modern* Arabic text and even if I do, Who is going to check it for errors ? I'm a coder, Not a linguist and Although I'm a native Arabic speaker, My language is not really that good and I don't really have much time. All the people out there complaining about the an Arabic spell checker didn't help in that part and I can say that I'm stuck. I'm welling to maintain the list of course, But I'm really unable to generate the initial one. I can't tell you not to use the Buckwalter data set as I don't have a replacement for you even if I don't like it and I know that I should either do something or STFU. Best regards, On Tue, Mar 07, 2006 at 10:46:26PM -0800, Ethan Bradford wrote: > > Hi, Mohammed et al. Gokalp Yapici and I are also working on getting Arabic > for Aspell. I thought we could share our plans to see if anybody wants to > offer us helpful feedback. > For character-set data, we started with the Farsi implementation in Aspell, > which uses utf-8 as the word-list encoding and Windows Arabic as the > internal encoding. > For a word list, our plan is to use the data from Buckwalter's Arabic > morphological analyzer -- the same data used in the Duali attempt at Arabic > spell checking. This data has a complex specification of the structure of > an Arabic word, which we'll need to translate into the simpler format > required by Aspell. > In Buckwalter's format, each stem, prefix, or suffix is a member of a stem, > prefix, or suffix class. Three auxilliary files specify which prefix > classes can connect to which stem classes; which stem classes can connect > to > which suffix classes; and which prefix classes are compatible with which > suffix classes. > If it weren't for that last file, this would be an easy problem: it would > just be a matter of translating code names. Instead, we'll write perl > scripts to recognize the easy translations (when no prefix/suffix > combination is allowd, or all combinations are allowed), and do the easy > thing. For the harder combinations (where some of the prefixes go to some > of the suffixes) we'll expand out the prefixes or the suffixes (whichever > there are fewer of), combining them with the stems as new "stem" entries. > There are a total of 170 affix (suffix and prefix) classes to start with. > We'll probably more than run out of Aspell class codes (they're limited to > 255) with the new classes we're creating. If that's very severe, I'll see > if we can't get Aspell updated to allow more suffix classes. Otherwise, > we'll just explicitly expand out the combinations which lead to the fewest > new entries in the stem list. > What are some of the issues we haven't thought of? Any feedback is > welcome! -- GNU/Linux registered user #224950 Proud Egyptian GNU/Linux User Group <www.eglug.org> Admin. Life powered by Debian, Homepage: www.foolab.org -- Don't send me any attachment in Micro$oft (.DOC, .PPT) format please Read http://www.gnu.org/philosophy/no-word-attachments.html Preferable attachments: .PDF, .HTML, .TXT Thanx for adding this text to Your signature
signature.asc
Description: Digital signature
_______________________________________________ Aspell-user mailing list [email protected] http://lists.gnu.org/mailman/listinfo/aspell-user
