Dear list,
I am getting different BLEU scores from the NIST mteval script
(version) and the multi-bleu.perl script within Moses's distribution
for the same reference and hypothesis translations -- even the
individual n-gram precisions are different:
BLEU = 16.80, 53.0/26.2/13.4/6.4 (BP=0.905, ratio=0.909, hyp_len=281,
ref_len=309)
and
BLEU score = 0.1681 for system "x"
Individual N-gram scoring
1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram
8-gram 9-gram
------ ------ ------ ------ ------ ------ ------
------ ------
BLEU: 0.5246 0.2591 0.1326 0.0630 0.0328 0.0213 0.0133
0.0046 0.0000 "x"
The files that produced the scores are here: mtj.ut.ee/diffbleu.tgz .
Does everyone else get different scores? Can anyone suggest a reason
for that? It's not the smoothing of the NIST script, both support UTF8
i/o, etc; so I'm out of ideas, and before comparing the
implementations I wanted to ask for opinions.
Thanks in advance,
Mark
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support