Hi, yes, that would be the only two things required to avoid crashes.
-phi On Thu, Dec 3, 2015 at 12:01 PM, Read, James C <[email protected]> wrote: > If I just clean and escape-special-characters would that be the minimum > requirement to get the training script to complete? > > > James > > > > ________________________________ > From: [email protected] <[email protected]> on behalf of Philipp Koehn > <[email protected]> > Sent: Wednesday, December 2, 2015 6:31 PM > To: Read, James C > Cc: Moses Support > Subject: Re: [Moses-support] Training script documentation > > Hi, > > the script expects tokenized data, and word alignment will fail if there are > too long sentences or if there is length mismatch in a sentence pair (e.g., > 1 word sentence translated as 70 word sentence). That's what the cleaning > script does. It also removes spurious spaces, which may throw some > processing steps off. Also, the provided tokenizer deals with special > characters like "|". If you do not use this tokenizer, you should run > scripts/tokenizer/escape-special-chars.perl to escape them. > > Truecasing is optional. Many do lowercasing. > > It does not matter to the training script how you prepare the data, so you > do not have to explicitly run these steps. You may already have tokenized > data, so no need to run the tokenizer. > > Whatever you specify with "-corpus" (full path!) should work, as long as the > issues spelled out in the first paragraph above are addressed. > > -phi > > On Wed, Dec 2, 2015 at 10:28 AM, Read, James C <[email protected]> wrote: >> >> In the past I've never been able to get the training script to run to >> completion without rigorously following the instructions here >> http://www.statmt.org/moses/?n=moses.baseline >> >> >> >> 1) Tokenise >> >> 2) Train truecaser >> >> 3) Truecase >> >> 4) Clean >> >> >> What if somebody wants to just tokenize and clean without truecasing or >> just clean without tokenizing? Why should the script bomb out? Is this >> something to do with formats required by early stages of the training >> process? >> >> >> James >> >> >> NOTE: This is not an open invitation to discuss why somebody would want to >> train models without tokenzing or truecasing. This is nothing more than a >> request for technical assistance. >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
