Hi Emna First, you will not get good results on Arabic with an English tokeniser. Try MADA, which does tokenisation and morphological segmentation for Arabic.
Secondly, you will need to extract the text from the xml before passing to Moses. You may find something suitable in m4loc (http://code.google.com/p/m4loc/) but in general there are many tools for handling xml. cheers - Barry On 24/10/14 02:30, emna hkiri wrote: > Dear Friends > i'm trying to build ar-en system. i have downloaded the arabic-english > // corpora from http://www.euromatrixplus.eu/multi-un/ > at first moses tokenizer do not include arabic language so i did it > with english > the second problem is that the corpus is in xml format.So english(also > arabic)texts after the tokenization are in this format because of the > tags of XML > > > < p n = " 2 " > > < s n = " 2 " > Agenda item 116 < / s > > < / p > > > so what should i do??? would you help me please i'm stuck at this point > thank you for your help > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
