IIRC, the principle difference is the calculation of the brevity penalty, but there also seem to be some slight differences in tokenization between the scripts.
On Fri, Mar 19, 2010 at 9:32 AM, Mark Fishel <[email protected]> wrote: > Dear list, > > I am getting different BLEU scores from the NIST mteval script > (version) and the multi-bleu.perl script within Moses's distribution > for the same reference and hypothesis translations -- even the > individual n-gram precisions are different: > > BLEU = 16.80, 53.0/26.2/13.4/6.4 (BP=0.905, ratio=0.909, hyp_len=281, > ref_len=309) > > and > > BLEU score = 0.1681 for system "x" > > Individual N-gram scoring > 1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram > 8-gram 9-gram > ------ ------ ------ ------ ------ ------ ------ > ------ ------ > BLEU: 0.5246 0.2591 0.1326 0.0630 0.0328 0.0213 0.0133 > 0.0046 0.0000 "x" > > The files that produced the scores are here: mtj.ut.ee/diffbleu.tgz . > > Does everyone else get different scores? Can anyone suggest a reason > for that? It's not the smoothing of the NIST script, both support UTF8 > i/o, etc; so I'm out of ideas, and before comparing the > implementations I wanted to ask for opinions. > > Thanks in advance, > Mark > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
