Abdelfetah Boumerdas <aa_boumerdas@...> writes: > > > > > Hi All, > i'm trying to build a translation model using moses, and to do that i'm using 2 corpora (europarl and the news commentary corpus provided in the manual) but when i reached the corpus preparation step i noticed the following problem: in the prepared europarl files i find that the apostrophe (') and the quotation marks are replaced respectively with (') and (") but in the second corpus they're still unchanged. > can anyone please tell me why?? is it a problem with the files encoding (i checked and they're both utf8)?? or is it another problem that i don't know about??? > Thanks in advance. > --
Hi Abdelfetah, some special characters (<, >, [, ], ", ', |) are reserved because they have special meaning in the phrase table and/or to support XML input. The tokenizer.perl script automatically replaces them with escape sequences, and the detokenizer unescapes them again. There's also the scripts (de)escape-special-chars.perl to go from one to the other without (de)tokenizing. consistency (between corpora and between training and test time) is important. Is it possible that you used different versions of the tokenizer.perl script? Older versions did not do escaping. best wishes, Rico _______________________________________________ Moses-support mailing list Moses-support@mit.edu http://mailman.mit.edu/mailman/listinfo/moses-support