It depends on the size of your parallel training corpus. I think with 
~200K segments or more, it's general practice to create dev & test sets 
of 2,000 - 2,500 each. Another way would be to calculate how many 
randomly drawn segments would represent a statistically significant 
sampling based on the total size of your corpus, much like selecting a 
random number of people to conduct a political or marketing survey.


On 02/15/2014 10:07 PM, Julian Myerscough wrote:
> Hi folks,
>
> Are there any rule of thumb proportions for the amount of text held out
> for tuning and training?
>
> eg 80% train 10% tune 10% test
>
> Thanks.
>
> Julian
>
>
> -------------------------------
>
> Julian Myerscough
> Quality Assurance Manager - Languages for Business Ltd
>
> Languages for Business Ltd
> PO Box 5194, Cardiff CF5 9DZ UK
> Tel: +44 (0)29 2044 4400  Fax: +44 (0)29 2044 4401
> [email protected] www.LfBtranslations.co.uk
>
> Office hours:
> 9:00 - 17:00 UTC/GMT   4:00 - 12:00 EST
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to