Re: [Moses-support] BLEU evaluation at WMT - contractions

Marcin Junczys-Dowmunt Fri, 10 Oct 2014 14:33:52 -0700

Hi Ondrej,
during the metrics task, is your baseline BLEU score also calculated by 
the NIST script with all its consequences?


Marcin

W dniu 10.10.2014 22:28, Ondrej Bojar pisze:
> Hi,
>
> I find only one good thing about the standard NIST script: that it is beyond 
> the control of anyone of us. ;-) It could be better only if the code were not 
> available. Then we'd have a truly black box measure. ;-)
>
> Yes, you're definitely right that mismatch in register should not be 
> penalized so much, but for WMT translation, we're actually not looking at 
> automatic scores at all. In some years, I think they were not even reported 
> in the paper. That's what the metric task is for: to promote metrics that 
> look at just the right things.
>
> Cheers, O.
>
> ----- Original Message -----
>> From: "Marcin Junczys-Dowmunt" <[email protected]>
>> To: "Philipp Koehn" <[email protected]>
>> Cc: [email protected]
>> Sent: Friday, 10 October, 2014 6:59:03 PM
>> Subject: Re: [Moses-support] BLEU evaluation at WMT - contractions
>>
>> Thanks for the quick answer.
>>
>> I admire the stoicism :) I find it painful to see that contractions are
>> not handled by the official script. You get two errors for not hitting
>> "we" and "are" when you have "we're" which is actually the same (modulo
>> style).  Also, I guess the news domain has less issues with contractions
>> otherwise you might have heard more complaints. Unfortunately I have to
>> provide results in WMT-style, so there is no way around that script.
>> METEOR does it right by the way.
>>
>> W dniu 10.10.2014 17:44, Philipp Koehn pisze:
>>> Hi,
>>>
>>> there are a lot of issues with tokenization.
>>>
>>> The BLEU scores we report in WMT are using the standard NIST script,
>>> which expects detokenized and properly cased output. The script does
>>> its own internal tokenization, we just accept that.
>>>
>>> Another way to compute BLEU scores is with multi-bleu.perl - which
>>> completely accepts your tokenization.
>>>
>>> -phi
>>>
>>>
>>> On Fri, Oct 10, 2014 at 11:12 AM, Marcin Junczys-Dowmunt
>>> <[email protected] <mailto:[email protected]>> wrote:
>>>
>>>      Hi,
>>>
>>>      slightly off-topic: I have a question concerning the evaluation
>>>      practice during WMT. I have noticed that the standard NIST script
>>>      mteval-v1.3a.pl <http://mteval-v1.3a.pl> (or any other versions)
>>>      does not split on apostrophes for English contractions. How was
>>>      this handled during the WMT? Did you use the official NIST scripts
>>>      for BLEU calculation after detokenization? If yes, this would
>>>      severely penalize the use of contractions over non-contracted
>>>      forms (around 2-3% BLEU), is this just generally accepted?
>>>
>>>      Thanks,
>>>
>>>      Marcin
>>>
>>>
>>>      _______________________________________________
>>>      Moses-support mailing list
>>>      [email protected] <mailto:[email protected]>
>>>      http://mailman.mit.edu/mailman/listinfo/moses-support
>>>
>>>
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] BLEU evaluation at WMT - contractions

Reply via email to