If I just clean and escape-special-characters would that be the minimum requirement to get the training script to complete?
James ________________________________ From: [email protected] <[email protected]> on behalf of Philipp Koehn <[email protected]> Sent: Wednesday, December 2, 2015 6:31 PM To: Read, James C Cc: Moses Support Subject: Re: [Moses-support] Training script documentation Hi, the script expects tokenized data, and word alignment will fail if there are too long sentences or if there is length mismatch in a sentence pair (e.g., 1 word sentence translated as 70 word sentence). That's what the cleaning script does. It also removes spurious spaces, which may throw some processing steps off. Also, the provided tokenizer deals with special characters like "|". If you do not use this tokenizer, you should run scripts/tokenizer/escape-special-chars.perl to escape them. Truecasing is optional. Many do lowercasing. It does not matter to the training script how you prepare the data, so you do not have to explicitly run these steps. You may already have tokenized data, so no need to run the tokenizer. Whatever you specify with "-corpus" (full path!) should work, as long as the issues spelled out in the first paragraph above are addressed. -phi On Wed, Dec 2, 2015 at 10:28 AM, Read, James C <[email protected]<mailto:[email protected]>> wrote: In the past I've never been able to get the training script to run to completion without rigorously following the instructions here http://www.statmt.org/moses/?n=moses.baseline 1) Tokenise 2) Train truecaser 3) Truecase 4) Clean What if somebody wants to just tokenize and clean without truecasing or just clean without tokenizing? Why should the script bomb out? Is this something to do with formats required by early stages of the training process? James NOTE: This is not an open invitation to discuss why somebody would want to train models without tokenzing or truecasing. This is nothing more than a request for technical assistance. _______________________________________________ Moses-support mailing list [email protected]<mailto:[email protected]> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
