Looks like I am using the right method, thank you Philipp.
2010/8/9, Philipp Koehn <[email protected]>: > Hi, > > scores will differ for different test sets. If you randomly sample the test > set > from you parallel data, they should be relatively similar, though. > > You can compute confidence intervals with bootstrap resampling, which > will give you some indication how reliable a, say, 0.5, 1.0, or 2.0 point > difference in BLEU is. > > Regarding test set sizes, it should be at least 1000 sentence pairs, > 12,000 is certainly very large. > > -phi > > On Sat, Aug 7, 2010 at 11:40 AM, Wenlong Yang <[email protected]> wrote: >> Hi all, >> >> can any of you help to provide some materials about how to select the >> sample >> data for BLEU/NIST evaluation? >> I mean, how many lines of data shoud I choose for the evaluation? and how >> can I choose the data to let them can be more representable for our >> domain/use? >> >> >> I have tried to generate BLEU score by using 1000 lines' sample data and >> 12000 lines' data, which of both are in our domain, but the second times' >> evaluation has higher scores, does this make sense? >> I actually trained two Moses engines, for the first evaluation (1000 >> line), >> Moses Engine1's score is lower than Moses Engine 2; but for the second >> time >> (12000 line), Moses Engine1's score is higher than Moses Engine 2. >> Which result should I trust? This phenominon makes me trust the scores >> less. >> >> Does anybody has any similar experiences? Is there any problem in my >> evaluation data? >> How can I generate more accurate scores? >> >> Thanks so much, >> Wenlong >> >> >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> >> > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
