Hi Meor, It is me, Dan, that released the Arabic wordlist. Nadav was explaining how one can rebuild an Arabic spell-checker, according to what we did for Hebrew.
> I've downloaded the file, and have some questions about the format. > The data is stored in ascii (i think ) which uses latin character to > represent arabic character. Please forgive my ignorance, but is there > a standard mapping for this? Any reference? It was Tim Buckwalter who built the database (see http://www.qamus.org/). What I released was a mere conversion, and I kept his files (almost) untouched. He uses his own transliteration of Arabic to ascii. You can see it in http://www.qamus.org/transliteration.gif . > I do notice the there is a perl script that translate it to ISO format, but > personally, I prefers UTF-8 encoding for the data. Most "modern" programs can > handle UTF-8 . I preferred working with 8 bit encoding internally - it is more economical. If you want to use utf8, you can do to_iso6.pl | iconv -f iso8859-6 -t utf8 > One thing about the data the Mr Sameer mentioned, is the concern about > Holy Quran words spelling. I think this one I can help a bit. I do Indeed. Having words in antiquated spelling is bad. However, I don't see any way to solve this problem without someone proficient in Arabic going over the 82,157 stems in the list and correct them. I is possible! > I don't know about aspell format, but I would like to add a few field > into the generated data. Probably the final data will be stored in a > database. The "must have" column for each words is it's root. If the > data is stored in a database, then this column can be just the > reference to the row which indicate it's root. So, when the users want > to lookup those words, I can tell you straight away it's root. I think > this is a big help. I think that what you are describing is called morphological analyzer, and such software exists http://www.nongnu.org/aramorph/, based on the same Buckwalter database. I, on the other hand, am "thinking small", and trying to limit myself to building a useful spell-checker. > Why database? Well, I think it's the easiest for lookup. As you > mentioned before, the generated dataset is huge, and some application > have some problem loading it to memory. If we store it in database, we > don't have to worry about that. Application just ask the database for > it. Also, it is easy to generate a new word list from the database to > suite other applications' need (aspell, dict format etc). The > advantage, as I mentioned before, we don't have to worry about storage > and memory management, plus, we get the benefit of relational data, > easier for cross referencing. I disagree with you on that. The database already exists. It was written in plaintext by Tim Buckwalter. I don't see why storing this database in SQL could help; it is not like there are millions of people trying to help to correct and extend the database :-( > Or maybe what we really need is a seperate dictionary for the Quran? Indeed, it would be useful to keep Quran words in a separate file, and not just replace them with the modern spelling. Someone has to do it. Hopefully, someone (maybe from this list) will. -- Dan Kenigsberg http://www.cs.technion.ac.il/~danken ICQ 162180901
_______________________________________________ Developer mailing list [email protected] http://lists.arabeyes.org/mailman/listinfo/developer

