Looks like I am using the right method, thank you Philipp.

2010/8/9, Philipp Koehn <[email protected]>:
> Hi,
>
> scores will differ for different test sets. If you randomly sample the test
> set
> from you parallel data, they should be relatively similar, though.
>
> You can compute confidence intervals with bootstrap resampling, which
> will give you some indication how reliable a, say, 0.5, 1.0, or 2.0 point
> difference in BLEU is.
>
> Regarding test set sizes, it should be at least 1000 sentence pairs,
> 12,000 is certainly very large.
>
> -phi
>
> On Sat, Aug 7, 2010 at 11:40 AM, Wenlong Yang <[email protected]> wrote:
>> Hi all,
>>
>> can any of you help to provide some materials about how to select the
>> sample
>> data for BLEU/NIST evaluation?
>> I mean, how many lines of data shoud I choose for the evaluation? and how
>> can I choose the data to let them can be more representable for our
>> domain/use?
>>
>>
>> I have tried to generate BLEU score by using 1000 lines' sample data and
>> 12000 lines' data, which of both are in our domain, but the second times'
>> evaluation has higher scores, does this make sense?
>> I actually trained two Moses engines, for the first evaluation (1000
>> line),
>> Moses Engine1's score is lower than Moses Engine 2; but for the second
>> time
>> (12000 line), Moses Engine1's score is higher than Moses Engine 2.
>> Which result should I trust? This phenominon makes me trust the scores
>> less.
>>
>> Does anybody has any similar experiences? Is there any problem in my
>> evaluation data?
>> How can I generate more accurate scores?
>>
>> Thanks so much,
>> Wenlong
>>
>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to