Hi, yes, I could add this easily. There are currently "wmt16/B" and "wmt17/B" test sets that include just the second reference. Do you anticipate using *just* the second reference? If so, I can create new test sets "wmt16/2" and "wm17/2" test sets that use both references. If you don't care about using just the second reference, I will repurpose the "*/B" to use both.
matt > On Nov 12, 2017, at 10:47 AM, Jörg Tiedemann <[email protected] > <mailto:[email protected]>> wrote: > > > This is nice! Could your tool even support an option that makes use of the > multi-reference test sets that are available for English-Finnish in 2016 and > 2017? They would finally be used for something if there would be a simple > option that allows to download and use those sets for standard evaluation. > Thanks! > > Jörg > > ********************************************************************************************** > Jörg Tiedemann > Department of Modern Languages http://blogs.helsinki.fi/tiedeman/ > <http://blogs.helsinki.fi/tiedeman/> > University of Helsinki > http://blogs.helsinki.fi/language-technology/ > <http://blogs.helsinki.fi/language-technology/> > > > >> On 11 Nov 2017, at 12:37, Matt Post <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hi, >> >> I’ve written a BLEU scoring tool called “sacreBLEU” that may be of use to >> people here. The goal is to get people to start reporting WMT-matrix >> compatible scores in their papers (i.e., scoring on detokenized outputs with >> a fixed reference tokenization) so that numbers can be compared directly, in >> the spirit of Rico Sennrich's multi-bleu-detok.pl. The nice part for you is >> that it auto-downloads WMT datasets and makes it so you no longer have to >> deal with SGML. You can install it via pip: >> >> pip3 install sacrebleu >> >> For starters, you can use it to easily download datasets: >> >> sacrebleu -t wmt17 -l en-de --echo src > wmt17.en-de.en >> sacrebleu -t wmt17 -l en-de --echo ref > wmt17.en-de.de >> <http://wmt17.en-de.de/> >> >> You don’t need to download the reference, though. You can just score against >> it using sacreBLEU directly. After decoding and detokenizing, try: >> >> cat output.detok.txt | sacrebleu -t wmt17 -l en-de >> >> I have tested and it produces the exact same numbers as Moses' >> mteval-v13a.pl, which is the official scoring script for WMT. It computes >> the exact same numbers for all 153 WMT17 system submissions (column >> BLEU-cased at matrix.statmt.org <http://matrix.statmt.org/>). For example: >> >> $ cat newstest2017.uedin-nmt.4722.en-de | sacrebleu -t wmt17 -l en-de >> >> BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt17+tok.13a+version.1.1.4 >> = 28.30 59.9/34.0/21.8/14.4 (BP = 1.000 ratio = 1.026 hyp_len = 62873 >> ref_len = 61287) >> >> This means numbers computed with it are directly comparable across papers. >> As you can see, in addition to the score, it outputs a version string that >> records the exact BLEU parameters used. The output string is compatible with >> the output of multi-bleu.pl, so your old code for parsing the BLEU score out >> of multi-bleu.pl should still work. >> >> You can also use the tool in a backward compatible mode with arbitrary >> references, the same way >> >> cat output.detok.txt | sacrebleu ref1 [ref2 …] >> >> The official code is in sockeye (Amazon’s NMT system): >> >> github.com >> <http://github.com/>/awslabs/sockeye/tree/master/contrib/sacrebleu >> <http://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu> >> >> I will also likely maintain a clone here: >> >> github.com/mjpost/sacreBLEU <http://github.com/mjpost/sacreBLEU> >> >> matt >> _______________________________________________ >> Moses-support mailing list >> [email protected] <mailto:[email protected]> >> http://mailman.mit.edu/mailman/listinfo/moses-support >> <http://mailman.mit.edu/mailman/listinfo/moses-support> >
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
