Hi,

the default Moses tokenizer encodes a number of characters ( | < > [ ]
' " ) because they may interfere with the phrase table format ( | ),
tree-based models ( [ ] ), and xml markup ( < > ' " ).

After translating the "detokenizer" puts all back together into nice text.

-phi

On Mon, Jun 3, 2013 at 6:41 PM, Per Tunedal <[email protected]> wrote:
> Hi,
> I'm a bit confused about the input format for training Moses and for
> translating with Moses.
>
> After cleaning, tokenizing and truecasing I get text that looks like
> this:
>
> un projet d &apos; abaissement du taux limite pour la conduite sous l
> &apos; empire d &apos; un état alcoolique sera présenté au Riksdag .
>
> Is that what French should look like? (d'abaissement becomes d &apos;
> abaissement)
>
> I suppose French test-sentences to translate should look the same,
> shouldn't they?
>
> Yours,
> Per Tunedal
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to