Hi all, Is there any body can help on this?
There might be some mistakes during the training of the Moses engine 2 (used the source language model), but the criterion to select the sample data for BLEU/NIST score evaluation is still sth I want to know/make sure. Thanks so, Wenlong 2010/8/7, Wenlong Yang <[email protected]>: > Hi all, > > can any of you help to provide some materials about how to select the > sample > data for BLEU/NIST evaluation? > I mean, how many lines of data shoud I choose for the evaluation? and how > can I choose the data to let them can be more representable for our > domain/use? > > > I have tried to generate BLEU score by using 1000 lines' sample data and > 12000 lines' data, which of both are in our domain, but the second times' > evaluation has higher scores, does this make sense? > I actually trained two Moses engines, for the first evaluation (1000 line), > Moses Engine1's score is lower than Moses Engine 2; but for the second time > (12000 line), Moses Engine1's score is higher than Moses Engine 2. > Which result should I trust? This phenominon makes me trust the scores > less. > > Does anybody has any similar experiences? Is there any problem in my > evaluation data? > How can I generate more accurate scores? > > Thanks so much, > Wenlong > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
