We use a random sample size calculation to determine the optimal sample size based on each bitext corpus size.http://en.wikipedia.org/wiki/Sample_size_determination <http://en.wikipedia.org/wiki/Sample_size_determination>. In an interesting choice of words, the wikipedia's introduction states, "The sample size is an important feature of any empirical study in which the goal is to make inferences about a population from a sample."

As it turns out, most corpora we encounter, the tuning set sizes fall somewhere in the middle of the range Philipp suggested, i.e. 2-3K lines.

Tom


On 10/05/2014 04:22 PM, Roee Aharoni wrote:
Hi,
In a recent post it was mentioned that "600k line tuning set is way too big. It will take forever. It's better to reduce it to 2-3k lines." Is there a reference to an empirical experiment searching for an "optimal" MERT tune set size?

Thanks,

—
Sent from Mailbox <https://www.dropbox.com/mailbox>


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to