If I just clean and escape-special-characters would that be the minimum 
requirement to get the training script to complete?


James


________________________________
From: [email protected] <[email protected]> on behalf of Philipp Koehn 
<[email protected]>
Sent: Wednesday, December 2, 2015 6:31 PM
To: Read, James C
Cc: Moses Support
Subject: Re: [Moses-support] Training script documentation

Hi,

the script expects tokenized data, and word alignment will fail if there are 
too long sentences or if there is length mismatch in a sentence pair (e.g., 1 
word sentence translated as 70 word sentence). That's what the cleaning script 
does. It also removes     spurious    spaces, which may throw some processing 
steps off. Also, the provided tokenizer deals with special characters like "|". 
If you do not use this tokenizer, you should run 
scripts/tokenizer/escape-special-chars.perl to escape them.

Truecasing is optional. Many do lowercasing.

It does not matter to the training script how you prepare the data, so you do 
not have to explicitly run these steps. You may already have tokenized data, so 
no need to run the tokenizer.

Whatever you specify with "-corpus" (full path!) should work, as long as the 
issues spelled out in the first paragraph above are addressed.

-phi

On Wed, Dec 2, 2015 at 10:28 AM, Read, James C 
<[email protected]<mailto:[email protected]>> wrote:

In the past I've never been able to get the training script to run to 
completion without rigorously following the instructions here 
http://www.statmt.org/moses/?n=moses.baseline



1) Tokenise

2) Train truecaser

3) Truecase

4) Clean


What if somebody wants to just tokenize and clean without truecasing or just 
clean without tokenizing? Why should the script bomb out? Is this something to 
do with formats required by early stages of the training process?


James


NOTE: This is not an open invitation to discuss why somebody would want to 
train models without tokenzing or truecasing. This is nothing more than a 
request for technical assistance.

_______________________________________________
Moses-support mailing list
[email protected]<mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to