It depends on the size of your parallel training corpus. I think with ~200K segments or more, it's general practice to create dev & test sets of 2,000 - 2,500 each. Another way would be to calculate how many randomly drawn segments would represent a statistically significant sampling based on the total size of your corpus, much like selecting a random number of people to conduct a political or marketing survey.
On 02/15/2014 10:07 PM, Julian Myerscough wrote: > Hi folks, > > Are there any rule of thumb proportions for the amount of text held out > for tuning and training? > > eg 80% train 10% tune 10% test > > Thanks. > > Julian > > > ------------------------------- > > Julian Myerscough > Quality Assurance Manager - Languages for Business Ltd > > Languages for Business Ltd > PO Box 5194, Cardiff CF5 9DZ UK > Tel: +44 (0)29 2044 4400 Fax: +44 (0)29 2044 4401 > [email protected] www.LfBtranslations.co.uk > > Office hours: > 9:00 - 17:00 UTC/GMT 4:00 - 12:00 EST > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
