Dear list,

I am getting different BLEU scores from the NIST mteval script
(version) and the multi-bleu.perl script within Moses's distribution
for the same reference and hypothesis translations -- even the
individual n-gram precisions are different:

BLEU = 16.80, 53.0/26.2/13.4/6.4 (BP=0.905, ratio=0.909, hyp_len=281,
ref_len=309)

and

BLEU score = 0.1681 for system "x"

Individual N-gram scoring
        1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram
8-gram   9-gram
        ------   ------   ------   ------   ------   ------   ------
------   ------
 BLEU:  0.5246   0.2591   0.1326   0.0630   0.0328   0.0213   0.0133
0.0046   0.0000  "x"

The files that produced the scores are here: mtj.ut.ee/diffbleu.tgz .

Does everyone else get different scores? Can anyone suggest a reason
for that? It's not the smoothing of the NIST script, both support UTF8
i/o, etc; so I'm out of ideas, and before comparing the
implementations I wanted to ask for opinions.

Thanks in advance,
Mark
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to