Since 'ar' is not known to the Moses tokenizer, I think it uses the default english tokenization scheme. It's unlikely to be good but it's better than nothing.
As you can tell, a good language-specific tokenizer, created by people who understand the language, is essential for good MT. If you only have a short time to do a project, the most useful thing you can do is an arabic tokenizer that is better/easier to use than MADA. I can help you integrate it into Moses if you want. ps. MADA has it's own mailing list: https://lists.cs.columbia.edu/cucslists/listinfo/mada-users On 8 July 2013 23:12, Heidi Heweidy <[email protected]> wrote: > Hello, > I am really anxious for help on setting up an arabic-english Moses system. > First, I installed the United Nations arabic english corpora found on: > http://www.euromatrixplus.net/multi-un/ > Then I tried to tokenize the arabic just as I did while following the > Moses tutorial with the French-English corpora. > I have a couple of questions: > a. Since Moses doesn't have "ar" as a language, what can I do to solve > this problem while tokenizing? > The error is as follows: > ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ar < > ~/corpus/training/xml/ar/2009/S_PV6164-ar.xml > > ~/corpus/S_PV6164-ar.tok.xml > Tokenizer Version 1.1 > Language: ar > Number of threads: 1 > WARNING: No known abbreviations for language 'ar', attempting fall-back to > English version... > > b. Can anyone who have used MADA+TOKAN help me out cause it seems > impossible for me to understand its tutorial: > http://www1.ccls.columbia.edu/MADA/CCLS-12-01.pdf > > > Thank you! > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > > -- Hieu Hoang Research Associate University of Edinburgh http://www.hoang.co.uk/hieu
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
