This is nice! Could your tool even support an option that makes use of the 
multi-reference test sets that are available for English-Finnish in 2016 and 
2017? They would finally be used for something if there would be a simple 
option that allows to download and use those sets for standard evaluation. 
Thanks!

Jörg

**********************************************************************************************
Jörg Tiedemann
Department of Modern Languages        http://blogs.helsinki.fi/tiedeman/
University of Helsinki                             
http://blogs.helsinki.fi/language-technology/



On 11 Nov 2017, at 12:37, Matt Post <[email protected]<mailto:[email protected]>> 
wrote:

Hi,

I’ve written a BLEU scoring tool called “sacreBLEU” that may be of use to 
people here. The goal is to get people to start reporting WMT-matrix compatible 
scores in their papers (i.e., scoring on detokenized outputs with a fixed 
reference tokenization) so that numbers can be compared directly, in the spirit 
of Rico Sennrich's multi-bleu-detok.pl. The nice part for you is that it 
auto-downloads WMT datasets and makes it so you no longer have to deal with 
SGML. You can install it via pip:

    pip3 install sacrebleu

For starters, you can use it to easily download datasets:

    sacrebleu -t wmt17 -l en-de --echo src > wmt17.en-de.en
    sacrebleu -t wmt17 -l en-de --echo ref > 
wmt17.en-de.de<http://wmt17.en-de.de/>

You don’t need to download the reference, though. You can just score against it 
using sacreBLEU directly. After decoding and detokenizing, try:

    cat output.detok.txt | sacrebleu -t wmt17 -l en-de

I have tested and it produces the exact same numbers as Moses' mteval-v13a.pl, 
which is the official scoring script for WMT. It computes the exact same 
numbers for all 153 WMT17 system submissions (column BLEU-cased at 
matrix.statmt.org<http://matrix.statmt.org/>). For example:

    $ cat newstest2017.uedin-nmt.4722.en-de | sacrebleu -t wmt17 -l en-de
    
BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.1.4
 = 28.30 59.9/34.0/21.8/14.4 (BP = 1.000 ratio = 1.026 hyp_len = 62873 ref_len 
= 61287)

This means numbers computed with it are directly comparable across papers. As 
you can see, in addition to the score, it outputs a version string that records 
the exact BLEU parameters used. The output string is compatible with the output 
of multi-bleu.pl, so your old code for parsing the BLEU score out of 
multi-bleu.pl should still work.

You can also use the tool in a backward compatible mode with arbitrary 
references, the same way

    cat output.detok.txt | sacrebleu ref1 [ref2 …]

The official code is in sockeye (Amazon’s NMT system):

    
github.com<http://github.com/>/awslabs/sockeye/tree/master/contrib/sacrebleu<http://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu>

I will also likely maintain a clone here:

    github.com/mjpost/sacreBLEU<http://github.com/mjpost/sacreBLEU>

matt
_______________________________________________
Moses-support mailing list
[email protected]<mailto:[email protected]>
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to