> > I would just like to know if there is a significant difference > when scoring translations using multi-bleu. > > With multi-bleu i got the following scores for testing on 2000 > sentences > > BLEU = 34.62, 63.4/38.8/27.8/21.3 (BP=0.996, ratio=0.996, > hyp_len=16587, > ref_len=16660) > > and the following for 5082 sentences > > BLEU = 3.82, 11.1/4.0/2.6/1.9 (1, 1.017,44536,43809) > > The only change i made was increased the corpus size from 6053 to > 8948.
First, a caveat: In general, BLEU scores are only comparable when they are computed using the same reference set. It's possible to get fairly divergent BLEU scores using an identical system on two different data sets from the same domain. That said, I've never seen differences anywhere near that large, so you should need to double-check your experimental setup. For instance, the second set of numbers are n-gram precisions (for increasing orders of n). In your example, the unigram precision went from 63.4 to 11.1, a sure sign of problems. To answer some of your other questions: > Another question is that what does other parameters except the first > which is > the BLEU score mean ? They are n-gram precisions, BLEU penalty, length ratio, hypothesis length, and reference length. For explanation, see the paper: http://aclweb.org/anthology-new/P/P02/P02-1040.pdf > Also, is multi-bleu in par with mteval? bleu-1.04.pl (IBM BLEU), mteval-11x.pl (NIST BLEU), and multi-bleu.pl (Moses BLEU) all report slightly different scores. > Can i consider a BLEU of 34.62 > to be correct. There is no such thing as "correct" with BLEU. Just make sure you use the same evaluation script for every output in your experiment. Cheers Adam -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
