>
> I would just like to know if there is a significant difference
> when scoring translations using multi-bleu.
>
> With multi-bleu i got the following scores for testing on 2000  
> sentences
>
> BLEU = 34.62, 63.4/38.8/27.8/21.3 (BP=0.996, ratio=0.996,  
> hyp_len=16587,
> ref_len=16660)
>
> and the following for 5082 sentences
>
> BLEU = 3.82, 11.1/4.0/2.6/1.9 (1, 1.017,44536,43809)
>
> The only change i made was increased the corpus size from 6053 to  
> 8948.

First, a caveat: In general, BLEU scores are only comparable when they  
are computed using the same reference set.  It's possible to get  
fairly divergent BLEU scores using an identical system on two  
different data sets from the same domain.

That said, I've never seen differences anywhere near that large, so  
you should need to double-check your experimental setup.  For  
instance, the second set of numbers are n-gram precisions (for  
increasing orders of n).  In your example, the unigram precision went  
from 63.4 to 11.1, a sure sign of problems.

To answer some of your other questions:

> Another question is that what does other parameters except the first  
> which is
> the BLEU score mean ?

They are n-gram precisions, BLEU penalty, length ratio, hypothesis  
length, and reference length.  For explanation, see the paper:
http://aclweb.org/anthology-new/P/P02/P02-1040.pdf

> Also, is multi-bleu in par with mteval?

bleu-1.04.pl (IBM BLEU), mteval-11x.pl (NIST BLEU), and multi-bleu.pl  
(Moses BLEU) all report slightly different scores.

> Can i consider a BLEU of 34.62
> to be correct.

There is no such thing as "correct" with BLEU.  Just make sure you use  
the same evaluation script for every output in your experiment.


Cheers
Adam


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to