IIRC, the principle difference is the calculation of the brevity
penalty, but there also seem to be some slight differences in
tokenization between the scripts.

On Fri, Mar 19, 2010 at 9:32 AM, Mark Fishel <[email protected]> wrote:
> Dear list,
>
> I am getting different BLEU scores from the NIST mteval script
> (version) and the multi-bleu.perl script within Moses's distribution
> for the same reference and hypothesis translations -- even the
> individual n-gram precisions are different:
>
> BLEU = 16.80, 53.0/26.2/13.4/6.4 (BP=0.905, ratio=0.909, hyp_len=281,
> ref_len=309)
>
> and
>
> BLEU score = 0.1681 for system "x"
>
> Individual N-gram scoring
>        1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram
> 8-gram   9-gram
>        ------   ------   ------   ------   ------   ------   ------
> ------   ------
>  BLEU:  0.5246   0.2591   0.1326   0.0630   0.0328   0.0213   0.0133
> 0.0046   0.0000  "x"
>
> The files that produced the scores are here: mtj.ut.ee/diffbleu.tgz .
>
> Does everyone else get different scores? Can anyone suggest a reason
> for that? It's not the smoothing of the NIST script, both support UTF8
> i/o, etc; so I'm out of ideas, and before comparing the
> implementations I wanted to ask for opinions.
>
> Thanks in advance,
> Mark
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to