Hello, thank you for your response. However I'm not quite sure I understand it right.
My observation is that those special signs are not good to have in the training corpus, as e.g. the truecaser and decoder get confused with those and do not provide their proper function. In the example I gave: " ( Das sind sehr ||| ' ( these are very ||| 0.5 2.47962e-05 0.333333 7.4064e-05 ||| 0-0 1-1 2-2 3-3 4-4 ||| 2 3 1 ||| ||| it is not correct, that "Das" in german "Das sind sehr" is translated into "these" lowercase. Also the produced entry is very specific with the quotation marks, so such entries just represent "noise" and lead only to increase of phrase table without any added value. It would be much better to have translation table without quotation marks, like: das sind sehr ||| these are very ||| 0.5 2.47962e-05 0.333333 7.4064e-05 ||| 0-0 1-1 2-2 3-3 4-4 ||| 2 3 1 ||| ||| the quotation marks can be translated as well, like: " ||| ' ||| ...... So either tokenizer does not work well, or one needs additional purificaiton steps afterwards to produce pure corpus for the training. Right? Regards Artem Shevchenko чт, 16 апр. 2020 г. в 23:47, Philipp Koehn <[email protected]>: > Hi, > > these items are introduced by the tokenizer - they are used to escape > characters that > have special meaning in (some) Moses components. > > They should show up in the phrase table, as you show them. Any input text > that is > pre-processed with the tokenizer will have them, and any output that is > post-processed > with the detokenizer will have them restored. > > -phi > > On Sat, Apr 4, 2020 at 7:44 PM Artem Shevchenko <[email protected]> wrote: > >> Hello, >> >> following the manual for baseline creaition, I have trained the model >> using Europarl v9 de-en pair. >> Now I observe that obtained phrase table contains a lot of noise. >> >> E.g. a lot of "' ", """ which seem to distort the model and >> decoder. >> E.g. truecasing did not work properly with those special symbols: >> >> " ( Das sind sehr ||| ' ( these are very ||| 0.5 2.47962e-05 >> 0.333333 7.4064e-05 ||| 0-0 1-1 2-2 3-3 4-4 ||| 2 3 1 ||| ||| >> >> Did you do any additional purification of the corpus before training? >> Please share your experience. >> >> Artem Shevchenko >> >>
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
