Hi,

yes, that would be the only two things required to avoid crashes.

-phi

On Thu, Dec 3, 2015 at 12:01 PM, Read, James C <[email protected]> wrote:
> If I just clean and escape-special-characters would that be the minimum
> requirement to get the training script to complete?
>
>
> James
>
>
>
> ________________________________
> From: [email protected] <[email protected]> on behalf of Philipp Koehn
> <[email protected]>
> Sent: Wednesday, December 2, 2015 6:31 PM
> To: Read, James C
> Cc: Moses Support
> Subject: Re: [Moses-support] Training script documentation
>
> Hi,
>
> the script expects tokenized data, and word alignment will fail if there are
> too long sentences or if there is length mismatch in a sentence pair (e.g.,
> 1 word sentence translated as 70 word sentence). That's what the cleaning
> script does. It also removes     spurious    spaces, which may throw some
> processing steps off. Also, the provided tokenizer deals with special
> characters like "|". If you do not use this tokenizer, you should run
> scripts/tokenizer/escape-special-chars.perl to escape them.
>
> Truecasing is optional. Many do lowercasing.
>
> It does not matter to the training script how you prepare the data, so you
> do not have to explicitly run these steps. You may already have tokenized
> data, so no need to run the tokenizer.
>
> Whatever you specify with "-corpus" (full path!) should work, as long as the
> issues spelled out in the first paragraph above are addressed.
>
> -phi
>
> On Wed, Dec 2, 2015 at 10:28 AM, Read, James C <[email protected]> wrote:
>>
>> In the past I've never been able to get the training script to run to
>> completion without rigorously following the instructions here
>> http://www.statmt.org/moses/?n=moses.baseline
>>
>>
>>
>> 1) Tokenise
>>
>> 2) Train truecaser
>>
>> 3) Truecase
>>
>> 4) Clean
>>
>>
>> What if somebody wants to just tokenize and clean without truecasing or
>> just clean without tokenizing? Why should the script bomb out? Is this
>> something to do with formats required by early stages of the training
>> process?
>>
>>
>> James
>>
>>
>> NOTE: This is not an open invitation to discuss why somebody would want to
>> train models without tokenzing or truecasing. This is nothing more than a
>> request for technical assistance.
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to