On Sat, Mar 11, 2006 at 07:32:17PM -0800, Ethan Bradford wrote: > > I don't see having archaic words as a particular problem. It only reduces > quality when a user misspells into one. Besides, some people might use > them, and even if they know how to spell them, we don't want to bother them > with spelling suggestions!
Yes, Some people might use them, But if you do use them, I guess you don't really need a spell checker, I won't trade 10% of the people with 90% of them! It doesn't make sense to me, > I don't see that we have a lot of options besides the Buckwalter data at > this moment. I think Arabic is too inflected to build a spell-checker from > a straight word list. True, I won't consider a word list fine unless it contains at least 1,000,000 words That's _my_ personal estimation, I don't like the Buckwalter approach as I think that it contains incorrect words, That's bad for a spell checker and since I know how the Arabic community works, I'll tell you that no one will ever review it and I guess "not sure" that no one will ever report an incorrect word. If I'll be the one doing it, I won't use the Buckwalter when I know that it contains errors, It's not the kind of thing I'd like to maintain. But as I said before, I'll STFU since I have nothing to offer. One of my friends is trying to get me modern Arabic files and spell check them using err M$ word "No flames please", It's just taking him a lot of time. Best regards, > Speaking of testing, does anybody on this list have good advice on testing > a > new dictionary? Just the obvious? Or perhaps how to start creating a dictionary ? > > On 3/10/06, Mohammed Sameer <[EMAIL PROTECTED]> wrote: > > Sorry for the delay, > I've tried generating a sample wordlist and it was fine, I don't really > know why we > assumed that aspell won't work with Arabic and M. Elzubeir started the > Duali project > and I forked Duali and coded Baghdad. > Well, I have a working spell checker implementation that is using the > Duali data set > which is originally the Buckwalter data set. > I can say that the set is not really accurate, It was identifying some > misspelled words > as correct and it was failing to identify correct words, While I can > accept it not to > identify all the correct words. I can't accept it saying that some > misspelled words > are correct. > We have a lot of old words in Arabic that are not really used and being a > native > Arabic speaker, I don't think that it's a good idea to list them in our > wordlist. > If you use such words then you don't need a spell checker because > definitely your > language background is solid enough :-) > I don't like the Buckwalter data set because it contains some incorrect > words "of course > it might be a problem in my implementation but it might be a problem with > the data set > itself" and because no one really had a look at it and removed old words. > My idea was to generate a somehow authentic data set but I don't have > enough *modern* > Arabic text and even if I do, Who is going to check it for errors ? I'm a > coder, Not a > linguist and Although I'm a native Arabic speaker, My language is not > really that good > and I don't really have much time. All the people out there complaining > about the > an Arabic spell checker didn't help in that part and I can say that I'm > stuck. > I'm welling to maintain the list of course, But I'm really unable to > generate > the initial one. > I can't tell you not to use the Buckwalter data set as I don't have a > replacement for > you even if I don't like it and I know that I should either do something > or STFU. > Best regards, > On Tue, Mar 07, 2006 at 10:46:26PM -0800, Ethan Bradford wrote: > > > > Hi, Mohammed et al. Gokalp Yapici and I are also working on getting > Arabic > > for Aspell. I thought we could share our plans to see if anybody > wants to > > offer us helpful feedback. > > For character-set data, we started with the Farsi implementation in > Aspell, > > which uses utf-8 as the word-list encoding and Windows Arabic as the > > internal encoding. > > For a word list, our plan is to use the data from Buckwalter's > Arabic > > morphological analyzer -- the same data used in the Duali attempt at > Arabic > > spell checking. This data has a complex specification of the > structure of > > an Arabic word, which we'll need to translate into the simpler > format > > required by Aspell. > > In Buckwalter's format, each stem, prefix, or suffix is a member of > a > stem, > > prefix, or suffix class. Three auxilliary files specify which > prefix > > classes can connect to which stem classes; which stem classes can > connect to > > which suffix classes; and which prefix classes are compatible with > which > > suffix classes. > > If it weren't for that last file, this would be an easy problem: it > would > > just be a matter of translating code names. Instead, we'll write > perl > > scripts to recognize the easy translations (when no prefix/suffix > > combination is allowd, or all combinations are allowed), and do the > easy > > thing. For the harder combinations (where some of the prefixes go > to > some > > of the suffixes) we'll expand out the prefixes or the suffixes > (whichever > > there are fewer of), combining them with the stems as new "stem" > entries. > > There are a total of 170 affix (suffix and prefix) classes to start > with. > > We'll probably more than run out of Aspell class codes (they're > limited to > > 255) with the new classes we're creating. If that's very severe, > I'll see > > if we can't get Aspell updated to allow more suffix > classes. Otherwise, > > we'll just explicitly expand out the combinations which lead to the > fewest > > new entries in the stem list. > > What are some of the issues we haven't thought of? Any feedback is > welcome! > -- > GNU/Linux registered user #224950 > Proud Egyptian GNU/Linux User Group <[2]www.eglug.org> Admin. > Life powered by Debian, Homepage: [3]www.foolab.org > -- > Don't send me any attachment in Micro$oft (.DOC, .PPT) format please > Read [4]http://www.gnu.org/philosophy/no-word-attachments.html > Preferable attachments: .PDF, .HTML, .TXT > Thanx for adding this text to Your signature > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.2 (GNU/Linux) > iD8DBQFEET9Jy2aOKaP9DfcRAmvDAKCOu1s8qbhxAeADTuekIHedgb+gygCfZg/j > 86BFFCgyCwWVV+VRKc5pQps= > =sILT > -----END PGP SIGNATURE----- > > References > > 1. mailto:[EMAIL PROTECTED] > 2. http://www.eglug.org/ > 3. http://www.foolab.org/ > 4. http://www.gnu.org/philosophy/no-word-attachments.html -- GNU/Linux registered user #224950 Proud Egyptian GNU/Linux User Group <www.eglug.org> Admin. Life powered by Debian, Homepage: www.foolab.org -- Don't send me any attachment in Micro$oft (.DOC, .PPT) format please Read http://www.gnu.org/philosophy/no-word-attachments.html Preferable attachments: .PDF, .HTML, .TXT Thanx for adding this text to Your signature _______________________________________________ Aspell-user mailing list [email protected] http://lists.gnu.org/mailman/listinfo/aspell-user
