Re: [Moses-support] phrase-table with ' " and other strage things. Additional corpus cleaning necessary?

Philipp Koehn Thu, 16 Apr 2020 14:54:21 -0700

Hi,

these items are introduced by the tokenizer - they are used to escape
characters that
have special meaning in (some) Moses components.

They should show up in the phrase table, as you show them. Any input text
that is
pre-processed with the tokenizer will have them, and any output that is
post-processed
with the detokenizer will have them restored.

-phi

On Sat, Apr 4, 2020 at 7:44 PM Artem Shevchenko <shev...@gmail.com> wrote:

> Hello,
>
> following the manual for baseline creaition, I have trained the model
> using Europarl v9 de-en pair.
> Now I observe that obtained phrase table contains a lot of noise.
>
> E.g. a lot of "&apos; ", "&quot;" which seem to distort the model and
> decoder.
> E.g. truecasing did not work properly with those special symbols:
>
> &quot; ( Das sind sehr ||| &apos; ( these are very ||| 0.5 2.47962e-05
> 0.333333 7.4064e-05 ||| 0-0 1-1 2-2 3-3 4-4 ||| 2 3 1 ||| |||
>
> Did you do any additional purification of the corpus before training?
> Please share your experience.
>
> Artem Shevchenko
>
>

_______________________________________________
Moses-support mailing list
Moses-support@mit.edu
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] phrase-table with ' " and other strage things. Additional corpus cleaning necessary?

Reply via email to