Dear Somayeh,

note also that the absolute scores differ heavily based on tokenization (I've 
seen difference of up to 10 points absolute). mteval-11b does tokenization on 
its own (possibly tokenizing tokenized input even further), multi-bleu from 
moses trusts your tokenization.

Another difference can come from the definition of "reference length" with 
multiple translations. Some usethe shortest ref. length, the original paper by 
Papineni says 'closest' but does not specify *which*! (If the hypothesis is 10 
words and two references are 8 and 12 words, which of the two has the closest 
length?) Implementations differ on this and they even sometimes depend on the 
*order* of multi references loaded!

The main message: never trust the numbers. Compare only BLEU scores you 
calculated yourself using a fixed tokenization tool and a fixed BLEU 
implementation.

Cheers, O.

"Somayeh Bakhshaei" <[email protected]> wrote:

>Hello,
>
>I have some question about mteval-v11b.pl
>
>1) It can not use multi-reference with mteval what is a equivalent tool for 
>this aim?
>2) I tried multi-bleu.perl, but the scores reduced ! while we expect to 
>increase while adding more reference sets !! How it is may?
>3) I test mteval-v11b.pl and multi-bleu.perl in equivalent situations, they do 
>not always agree ! sometimes mteval and sometimes the other gives better 
>scores. Is there any problem?
>4) and at the end, isn't there any better tool with the property of 
>multi-reference?
>
>------------------
>
>Best Regards,
>
>S.Bakhshaei
>
>
>      _______________________________________________
>Moses-support mailing list
>[email protected]
>http://mailman.mit.edu/mailman/listinfo/moses-support


-- 
Ondrej Bojar
http://www.cuni.cz/~obo
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to