This is nice! Could your tool even support an option that makes use of the multi-reference test sets that are available for English-Finnish in 2016 and 2017? They would finally be used for something if there would be a simple option that allows to download and use those sets for standard evaluation. Thanks!
Jörg ********************************************************************************************** Jörg Tiedemann Department of Modern Languages http://blogs.helsinki.fi/tiedeman/ University of Helsinki http://blogs.helsinki.fi/language-technology/ On 11 Nov 2017, at 12:37, Matt Post <[email protected]<mailto:[email protected]>> wrote: Hi, I’ve written a BLEU scoring tool called “sacreBLEU” that may be of use to people here. The goal is to get people to start reporting WMT-matrix compatible scores in their papers (i.e., scoring on detokenized outputs with a fixed reference tokenization) so that numbers can be compared directly, in the spirit of Rico Sennrich's multi-bleu-detok.pl. The nice part for you is that it auto-downloads WMT datasets and makes it so you no longer have to deal with SGML. You can install it via pip: pip3 install sacrebleu For starters, you can use it to easily download datasets: sacrebleu -t wmt17 -l en-de --echo src > wmt17.en-de.en sacrebleu -t wmt17 -l en-de --echo ref > wmt17.en-de.de<http://wmt17.en-de.de/> You don’t need to download the reference, though. You can just score against it using sacreBLEU directly. After decoding and detokenizing, try: cat output.detok.txt | sacrebleu -t wmt17 -l en-de I have tested and it produces the exact same numbers as Moses' mteval-v13a.pl, which is the official scoring script for WMT. It computes the exact same numbers for all 153 WMT17 system submissions (column BLEU-cased at matrix.statmt.org<http://matrix.statmt.org/>). For example: $ cat newstest2017.uedin-nmt.4722.en-de | sacrebleu -t wmt17 -l en-de BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.1.4 = 28.30 59.9/34.0/21.8/14.4 (BP = 1.000 ratio = 1.026 hyp_len = 62873 ref_len = 61287) This means numbers computed with it are directly comparable across papers. As you can see, in addition to the score, it outputs a version string that records the exact BLEU parameters used. The output string is compatible with the output of multi-bleu.pl, so your old code for parsing the BLEU score out of multi-bleu.pl should still work. You can also use the tool in a backward compatible mode with arbitrary references, the same way cat output.detok.txt | sacrebleu ref1 [ref2 …] The official code is in sockeye (Amazon’s NMT system): github.com<http://github.com/>/awslabs/sockeye/tree/master/contrib/sacrebleu<http://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu> I will also likely maintain a clone here: github.com/mjpost/sacreBLEU<http://github.com/mjpost/sacreBLEU> matt _______________________________________________ Moses-support mailing list [email protected]<mailto:[email protected]> http://mailman.mit.edu/mailman/listinfo/moses-support
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
