There have been several discussions about the non-deterministic
nature of moses and its processes, especially mert-moses.pl. This is
about mteval-12.pl (and other BLEU scoring tools?) 

So, created a
translation model with ~1.5 million phrase pairs and extracted 5,000
random pairs. Then, I shuffled the 5,000 pairs five times to create five
different sets of 2,500 pairs for tuning and evaluation, each with a
unique blend of the original 5,000. 

The five mert-moses.pl sessions
ran beteween 8 to 15 iterations each. As expected the five resulting
BLEU scores in the final moses.ini file were different. They ranged from
0.8551 to 0.8590. Then, I ran mteval-12.pl with the second half of each
set. The resulting BLEU scores were typically slightly lower than the
BLEU score reported the moses.ini file. So far, so good. 

Here's what I
didn't expect. I shuffled the order of the pairs in the evaluation set
and ran mteval-12.pl again for each set. For each set, the same data
shuffled in a different order and run through mteval-12.pl resulted in
different cumulative BLEU scores. These scores varied from 0.8520 to
0.8627. Same data, different evaluation order. 

Can someone confirm
that the mteval-12.pl reporting tool is also non-deterministic when
evaluating the same data, even when non-related elements are evaluated
in a different order? Do all BLEU score tools (mteval-12, mteval13, etc)
have the same results?
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to