Hi Emna

First, you will not get good results on Arabic with an English 
tokeniser. Try MADA, which does tokenisation and morphological 
segmentation for Arabic.

Secondly, you will need to extract the text from the xml before passing 
to Moses. You may find something suitable in m4loc 
(http://code.google.com/p/m4loc/) but in general there are many tools 
for handling xml.

cheers - Barry

On 24/10/14 02:30, emna hkiri wrote:
> Dear Friends
> i'm trying to build ar-en system. i have downloaded the arabic-english 
> // corpora from http://www.euromatrixplus.eu/multi-un/
> at first moses tokenizer do not include arabic language so i did it 
> with english
> the second problem is that the corpus is in xml format.So english(also 
> arabic)texts after the tokenization are in this format because of the 
> tags of XML
>
>
> < p n = " 2 " >
> < s n = " 2 " > Agenda item 116 < / s >
> < / p >
>
> so what should i do??? would you help me please i'm stuck at this point
> thank you for your help
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to