[Aspell-user] Re: Feedback on our approach to Arabic

Mohammed Sameer Sun, 12 Mar 2006 03:11:34 -0800

On Sat, Mar 11, 2006 at 07:32:17PM -0800, Ethan Bradford wrote:
> 
>    I don't see having archaic words as a particular problem.  It only reduces
>    quality when a user misspells into one.  Besides, some people might use
>    them, and even if they know how to spell them, we don't want to bother them
>    with spelling suggestions!


Yes, Some people might use them, But if you do use them, I guess you don't 
really
need a spell checker,

I won't trade 10% of the people with 90% of them!

It doesn't make sense to me,

>    I don't see that we have a lot of options besides the Buckwalter data at
>    this moment.  I think Arabic is too inflected to build a spell-checker from
>    a straight word list.

True, I won't consider a word list fine unless it contains at least 1,000,000 
words
That's _my_ personal estimation, I don't like the Buckwalter approach as I think
that it contains incorrect words, That's bad for a spell checker and since I 
know
how the Arabic community works, I'll tell you that no one will ever review it 
and I guess
"not sure" that no one will ever report an incorrect word.

If I'll be the one doing it, I won't use the Buckwalter when I know that it 
contains
errors, It's not the kind of thing I'd like to maintain. But as I said before, 
I'll
STFU since I have nothing to offer.

One of my friends is trying to get me modern Arabic files and spell check them 
using
err M$ word "No flames please", It's just taking him a lot of time.

Best regards,

>    Speaking of testing, does anybody on this list have good advice on testing 
> a
>    new dictionary?  Just the obvious?

Or perhaps how to start creating a dictionary ?

> 
>    On 3/10/06, Mohammed Sameer <[EMAIL PROTECTED]> wrote:
> 
>      Sorry for the delay,
>      I've tried generating a sample wordlist and it was fine, I don't really
>      know why we
>      assumed that aspell won't work with Arabic and M. Elzubeir started the
>      Duali project
>      and I forked Duali and coded Baghdad.
>      Well, I have a working spell checker implementation that is using the
>      Duali data set
>      which is originally the Buckwalter data set.
>      I can say that the set is not really accurate, It was identifying some
>      misspelled words
>      as correct and it was failing to identify correct words, While I can
>      accept it not to
>      identify all the correct words. I can't accept it saying that some
>      misspelled words
>      are correct.
>      We have a lot of old words in Arabic that are not really used and being a
>      native
>      Arabic speaker, I don't think that it's a good idea to list them in our
>      wordlist.
>      If you use such words then you don't need a spell checker because
>      definitely your
>      language background is solid enough :-)
>      I don't like the Buckwalter data set because it contains some incorrect
>      words "of course
>      it might be a problem in my implementation but it might be a problem with
>      the data set
>      itself" and because no one really had a look at it and removed old words.
>      My idea was to generate a somehow authentic data set but I don't have
>      enough *modern*
>      Arabic text and even if I do, Who is going to check it for errors ? I'm a
>      coder, Not a
>      linguist and Although I'm a native Arabic speaker, My language is not
>      really that good
>      and I don't really have much time. All the people out there complaining
>      about the
>      an Arabic spell checker didn't help in that part and I can say that I'm
>      stuck.
>      I'm welling to maintain the list of course, But I'm really unable to
>      generate
>      the initial one.
>      I can't tell you not to use the Buckwalter data set as I don't have a
>      replacement for
>      you even if I don't like it and I know that I should either do something
>      or STFU.
>      Best regards,
>      On Tue, Mar 07, 2006 at 10:46:26PM -0800, Ethan Bradford wrote:
>      >
>      >    Hi, Mohammed et al.  Gokalp Yapici and I are also working on getting
>      Arabic
>      >    for Aspell.  I thought we could share our plans to see if anybody
>      wants to
>      >    offer us helpful feedback.
>      >    For character-set data, we started with the Farsi implementation in
>      Aspell,
>      >    which uses utf-8 as the word-list encoding and Windows Arabic as the
>      >    internal encoding.
>      >    For a word list, our plan is to use the data from Buckwalter's 
> Arabic
>      >    morphological analyzer -- the same data used in the Duali attempt at
>      Arabic
>      >    spell checking.  This data has a complex specification of the
>      structure of
>      >    an Arabic word, which we'll need to translate into the simpler 
> format
>      >    required by Aspell.
>      >    In Buckwalter's format, each stem, prefix, or suffix is a member of 
> a
>      stem,
>      >    prefix, or suffix class.  Three auxilliary files specify which 
> prefix
>      >    classes can connect to which stem classes; which stem classes can
>      connect to
>      >    which suffix classes; and which prefix classes are compatible with
>      which
>      >    suffix classes.
>      >    If it weren't for that last file, this would be an easy problem: it
>      would
>      >    just be a matter of translating code names.  Instead, we'll write
>      perl
>      >    scripts to recognize the easy translations (when no prefix/suffix
>      >    combination is allowd, or all combinations are allowed), and do the
>      easy
>      >    thing.  For the harder combinations (where some of the prefixes go 
> to
>      some
>      >    of the suffixes) we'll expand out the prefixes or the suffixes
>      (whichever
>      >    there are fewer of), combining them with the stems as new "stem"
>      entries.
>      >    There are a total of 170 affix (suffix and prefix) classes to start
>      with.
>      >    We'll probably more than run out of Aspell class codes (they're
>      limited to
>      >    255) with the new classes we're creating.  If that's very severe,
>      I'll see
>      >    if we can't get Aspell updated to allow more suffix
>      classes.  Otherwise,
>      >    we'll just explicitly expand out the combinations which lead to the
>      fewest
>      >    new entries in the stem list.
>      >    What are some of the issues we haven't thought of?  Any feedback is
>      welcome!
>      --
>      GNU/Linux registered user #224950
>      Proud Egyptian GNU/Linux User Group <[2]www.eglug.org> Admin.
>      Life powered by Debian, Homepage: [3]www.foolab.org 
>      --
>      Don't send me any attachment in Micro$oft (.DOC, .PPT) format please
>      Read [4]http://www.gnu.org/philosophy/no-word-attachments.html
>      Preferable attachments: .PDF, .HTML, .TXT
>      Thanx for adding this text to Your signature
>      -----BEGIN PGP SIGNATURE-----
>      Version: GnuPG v1.4.2 (GNU/Linux)
>      iD8DBQFEET9Jy2aOKaP9DfcRAmvDAKCOu1s8qbhxAeADTuekIHedgb+gygCfZg/j
>      86BFFCgyCwWVV+VRKc5pQps=
>      =sILT
>      -----END PGP SIGNATURE-----
> 
> References
> 
>    1. mailto:[EMAIL PROTECTED]
>    2. http://www.eglug.org/
>    3. http://www.foolab.org/
>    4. http://www.gnu.org/philosophy/no-word-attachments.html

-- 
GNU/Linux registered user #224950
Proud Egyptian GNU/Linux User Group <www.eglug.org> Admin.
Life powered by Debian, Homepage: www.foolab.org
--
Don't send me any attachment in Micro$oft (.DOC, .PPT) format please
Read http://www.gnu.org/philosophy/no-word-attachments.html
Preferable attachments: .PDF, .HTML, .TXT
Thanx for adding this text to Your signature


_______________________________________________
Aspell-user mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/aspell-user

[Aspell-user] Re: Feedback on our approach to Arabic

Reply via email to