Hi,

the script expects tokenized data, and word alignment will fail if there
are too long sentences or if there is length mismatch in a sentence pair
(e.g., 1 word sentence translated as 70 word sentence). That's what the
cleaning script does. It also removes     spurious    spaces, which may
throw some processing steps off. Also, the provided tokenizer deals with
special characters like "|". If you do not use this tokenizer, you should
run scripts/tokenizer/escape-special-chars.perl to escape them.

Truecasing is optional. Many do lowercasing.

It does not matter to the training script how you prepare the data, so you
do not have to explicitly run these steps. You may already have tokenized
data, so no need to run the tokenizer.

Whatever you specify with "-corpus" (full path!) should work, as long as
the issues spelled out in the first paragraph above are addressed.

-phi

On Wed, Dec 2, 2015 at 10:28 AM, Read, James C <[email protected]> wrote:

> In the past I've never been able to get the training script to run to
> completion without rigorously following the instructions here
> http://www.statmt.org/moses/?n=moses.baseline
>
>
> 1) Tokenise
>
> 2) Train truecaser
>
> 3) Truecase
>
> 4) Clean
>
>
> What if somebody wants to just tokenize and clean without truecasing or
> just clean without tokenizing? Why should the script bomb out? Is this
> something to do with formats required by early stages of the training
> process?
>
>
> James
>
>
> NOTE: This is not an open invitation to discuss why somebody would want to
> train models without tokenzing or truecasing. This is nothing more than a
> request for technical assistance.
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to