Since 'ar' is not known to the Moses tokenizer, I think it uses the default
english tokenization scheme. It's unlikely to be good but it's better than
nothing.

As you can tell, a good language-specific tokenizer, created by people who
understand the language, is essential for good MT.

If you only have a short time to do a project, the most useful thing you
can do is an arabic tokenizer that is better/easier to use than MADA. I can
help you integrate it into Moses if you want.


ps. MADA has it's own mailing list:
   https://lists.cs.columbia.edu/cucslists/listinfo/mada-users


On 8 July 2013 23:12, Heidi Heweidy <[email protected]> wrote:

> Hello,
> I am really anxious for help on setting up an arabic-english Moses system.
> First, I installed the United Nations arabic english corpora found on:
> http://www.euromatrixplus.net/multi-un/
> Then I tried to tokenize the arabic just as I did while following the
> Moses tutorial with the French-English corpora.
> I have a couple of questions:
> a. Since Moses doesn't have "ar" as a language, what can I do to solve
> this problem while tokenizing?
> The error is as follows:
>  ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l ar <
> ~/corpus/training/xml/ar/2009/S_PV6164-ar.xml >
>  ~/corpus/S_PV6164-ar.tok.xml
> Tokenizer Version 1.1
> Language: ar
> Number of threads: 1
> WARNING: No known abbreviations for language 'ar', attempting fall-back to
> English version...
>
> b. Can anyone who have used MADA+TOKAN help me out cause it seems
> impossible for me to understand its tutorial:
> http://www1.ccls.columbia.edu/MADA/CCLS-12-01.pdf
>
>
> Thank you!
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>


-- 
Hieu Hoang
Research Associate
University of Edinburgh
http://www.hoang.co.uk/hieu
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to