Hi, the default Moses tokenizer encodes a number of characters ( | < > [ ] ' " ) because they may interfere with the phrase table format ( | ), tree-based models ( [ ] ), and xml markup ( < > ' " ).
After translating the "detokenizer" puts all back together into nice text. -phi On Mon, Jun 3, 2013 at 6:41 PM, Per Tunedal <[email protected]> wrote: > Hi, > I'm a bit confused about the input format for training Moses and for > translating with Moses. > > After cleaning, tokenizing and truecasing I get text that looks like > this: > > un projet d ' abaissement du taux limite pour la conduite sous l > ' empire d ' un état alcoolique sera présenté au Riksdag . > > Is that what French should look like? (d'abaissement becomes d ' > abaissement) > > I suppose French test-sentences to translate should look the same, > shouldn't they? > > Yours, > Per Tunedal > > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
