[Moses-support] different bleu scores from nist and moses scripts

Mark Fishel Fri, 19 Mar 2010 02:33:08 -0700

Dear list,

I am getting different BLEU scores from the NIST mteval script
(version) and the multi-bleu.perl script within Moses's distribution
for the same reference and hypothesis translations -- even the
individual n-gram precisions are different:


BLEU = 16.80, 53.0/26.2/13.4/6.4 (BP=0.905, ratio=0.909, hyp_len=281,
ref_len=309)

and

BLEU score = 0.1681 for system "x"

Individual N-gram scoring
        1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram
8-gram   9-gram
        ------   ------   ------   ------   ------   ------   ------
------   ------
 BLEU:  0.5246   0.2591   0.1326   0.0630   0.0328   0.0213   0.0133
0.0046   0.0000  "x"

The files that produced the scores are here: mtj.ut.ee/diffbleu.tgz .

Does everyone else get different scores? Can anyone suggest a reason
for that? It's not the smoothing of the NIST script, both support UTF8
i/o, etc; so I'm out of ideas, and before comparing the
implementations I wanted to ask for opinions.

Thanks in advance,
Mark
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] different bleu scores from nist and moses scripts

Reply via email to