Hi,

scores will differ for different test sets. If you randomly sample the test set
from you parallel data, they should be relatively similar, though.

You can compute confidence intervals with bootstrap resampling, which
will give you some indication how reliable a, say, 0.5, 1.0, or 2.0 point
difference in BLEU is.

Regarding test set sizes, it should be at least 1000 sentence pairs,
12,000 is certainly very large.

-phi

On Sat, Aug 7, 2010 at 11:40 AM, Wenlong Yang <[email protected]> wrote:
> Hi all,
>
> can any of you help to provide some materials about how to select the sample
> data for BLEU/NIST evaluation?
> I mean, how many lines of data shoud I choose for the evaluation? and how
> can I choose the data to let them can be more representable for our
> domain/use?
>
>
> I have tried to generate BLEU score by using 1000 lines' sample data and
> 12000 lines' data, which of both are in our domain, but the second times'
> evaluation has higher scores, does this make sense?
> I actually trained two Moses engines, for the first evaluation (1000 line),
> Moses Engine1's score is lower than Moses Engine 2; but for the second time
> (12000 line), Moses Engine1's score is higher than Moses Engine 2.
> Which result should I trust? This phenominon makes me trust the scores less.
>
> Does anybody has any similar experiences? Is there any problem in my
> evaluation data?
> How can I generate more accurate scores?
>
> Thanks so much,
> Wenlong
>
>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to