Nadav, I think the approach is good (based on my limited knowledge of arabic), although I'm not qualified to judge because I'm not an arabic speaker.
I've downloaded the file, and have some questions about the format. The data is stored in ascii (i think ) which uses latin character to represent arabic character. Please forgive my ignorance, but is there a standard mapping for this? Any reference? I do notice the there is a perl script that translate it to ISO format, but personally, I prefers UTF-8 encoding for the data. Most "modern" programs can handle UTF-8 . One thing about the data the Mr Sameer mentioned, is the concern about Holy Quran words spelling. I think this one I can help a bit. I do have the complete Quran Text, complete with all the marks in text format. Here is what I have in mind. I don't know about aspell format, but I would like to add a few field into the generated data. Probably the final data will be stored in a database. The "must have" column for each words is it's root. If the data is stored in a database, then this column can be just the reference to the row which indicate it's root. So, when the users want to lookup those words, I can tell you straight away it's root. I think this is a big help. Why database? Well, I think it's the easiest for lookup. As you mentioned before, the generated dataset is huge, and some application have some problem loading it to memory. If we store it in database, we don't have to worry about that. Application just ask the database for it. Also, it is easy to generate a new word list from the database to suite other applications' need (aspell, dict format etc). The advantage, as I mentioned before, we don't have to worry about storage and memory management, plus, we get the benefit of relational data, easier for cross referencing. As I said before, I don't have much experience in this, but seems like this datased is the one that I've been looking for. As for the Quran words, once the dataset is in the database, I can do a lookup for it. If necessary, add the modification. Or maybe what we really need is a seperate dictionary for the Quran? Regards. On 5/17/06, Nadav Har'El <[EMAIL PROTECTED]> wrote:
On Tue, May 16, 2006, Mohammed Sameer wrote about "Re: A (too huge) Arabic word-list (with prefixes) for spell-checkers": >... > The data set contains words from the Holy Quran, The words in the Holy Quran are sometimes > spelled in a different way due to the script used to write the Quran. > > Those words are incorrect outside the Quran context. >... I have looked at Dan's example on http://ivrix.org.il/projects/arabic/, and it seems that spell-checking a modern Arabic text (that he took from Wikipedia) worked quite well. Could it be that while that word list is not 100% correct, it still contains a substancial amount of correct data, and, say, 90% of the words it lists are spelled correctly, and most of the remaining words can easily be fixed by an Arabic writer? The reason I'm asking this is because, like I said, 90% of the work that went into Hspell was building the lexicon. We spent a very large amount of time sifting through texts, looking for spelling errors which are in fact correct words, to add to the lexicon. This effort became harder and hard as our lexicon grew, and I estimate that now it takes me 10 times the effort to find a new word to add than it took me when I was adding the first 1000 words. The reason we had to do this slow word-finding process was that it is illegal to just open a Hebrew dictionary, and start copying the words one by one, so we had to find other ways to come up with missing words (we obviously couldn't just "recall" words from memory, and we had no free Hebrew lexicon). With Tim Buckwalter's list, you have a much better start than we did: you can actually go over his list, word by word, and remove, or better yet fix, any mispelled word. It should be easier, I think, than to start from scrach. Of course, you still need an inflection program in addition to the lexicon. If you think that Tim Buckwalter's inflection program creates wrong inflections, you can write a different one. -- Nadav Har'El | Tuesday, May 16 2006, 19 Iyyar 5766 [EMAIL PROTECTED] |----------------------------------------- Phone +972-523-790466, ICQ 13349191 |Hindsight is always 20:20 http://nadav.harel.org.il | _______________________________________________ Developer mailing list [email protected] http://lists.arabeyes.org/mailman/listinfo/developer
_______________________________________________ Developer mailing list [email protected] http://lists.arabeyes.org/mailman/listinfo/developer

