Hi, the script expects tokenized data, and word alignment will fail if there are too long sentences or if there is length mismatch in a sentence pair (e.g., 1 word sentence translated as 70 word sentence). That's what the cleaning script does. It also removes spurious spaces, which may throw some processing steps off. Also, the provided tokenizer deals with special characters like "|". If you do not use this tokenizer, you should run scripts/tokenizer/escape-special-chars.perl to escape them.
Truecasing is optional. Many do lowercasing. It does not matter to the training script how you prepare the data, so you do not have to explicitly run these steps. You may already have tokenized data, so no need to run the tokenizer. Whatever you specify with "-corpus" (full path!) should work, as long as the issues spelled out in the first paragraph above are addressed. -phi On Wed, Dec 2, 2015 at 10:28 AM, Read, James C <[email protected]> wrote: > In the past I've never been able to get the training script to run to > completion without rigorously following the instructions here > http://www.statmt.org/moses/?n=moses.baseline > > > 1) Tokenise > > 2) Train truecaser > > 3) Truecase > > 4) Clean > > > What if somebody wants to just tokenize and clean without truecasing or > just clean without tokenizing? Why should the script bomb out? Is this > something to do with formats required by early stages of the training > process? > > > James > > > NOTE: This is not an open invitation to discuss why somebody would want to > train models without tokenzing or truecasing. This is nothing more than a > request for technical assistance. > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
