Re: [Moses-support] BLEU evaluation at WMT - contractions

Ondrej Bojar Fri, 10 Oct 2014 13:32:23 -0700

Hi,

I find only one good thing about the standard NIST script: that it is beyond 
the control of anyone of us. ;-) It could be better only if the code were not 
available. Then we'd have a truly black box measure. ;-)


Yes, you're definitely right that mismatch in register should not be penalized 
so much, but for WMT translation, we're actually not looking at automatic 
scores at all. In some years, I think they were not even reported in the paper. 
That's what the metric task is for: to promote metrics that look at just the 
right things.

Cheers, O.

----- Original Message -----
> From: "Marcin Junczys-Dowmunt" <[email protected]>
> To: "Philipp Koehn" <[email protected]>
> Cc: [email protected]
> Sent: Friday, 10 October, 2014 6:59:03 PM
> Subject: Re: [Moses-support] BLEU evaluation at WMT - contractions
> 
> Thanks for the quick answer.
> 
> I admire the stoicism :) I find it painful to see that contractions are
> not handled by the official script. You get two errors for not hitting
> "we" and "are" when you have "we're" which is actually the same (modulo
> style).  Also, I guess the news domain has less issues with contractions
> otherwise you might have heard more complaints. Unfortunately I have to
> provide results in WMT-style, so there is no way around that script.
> METEOR does it right by the way.
> 
> W dniu 10.10.2014 17:44, Philipp Koehn pisze:
> > Hi,
> >
> > there are a lot of issues with tokenization.
> >
> > The BLEU scores we report in WMT are using the standard NIST script,
> > which expects detokenized and properly cased output. The script does
> > its own internal tokenization, we just accept that.
> >
> > Another way to compute BLEU scores is with multi-bleu.perl - which
> > completely accepts your tokenization.
> >
> > -phi
> >
> >
> > On Fri, Oct 10, 2014 at 11:12 AM, Marcin Junczys-Dowmunt
> > <[email protected] <mailto:[email protected]>> wrote:
> >
> >     Hi,
> >
> >     slightly off-topic: I have a question concerning the evaluation
> >     practice during WMT. I have noticed that the standard NIST script
> >     mteval-v1.3a.pl <http://mteval-v1.3a.pl> (or any other versions)
> >     does not split on apostrophes for English contractions. How was
> >     this handled during the WMT? Did you use the official NIST scripts
> >     for BLEU calculation after detokenization? If yes, this would
> >     severely penalize the use of contractions over non-contracted
> >     forms (around 2-3% BLEU), is this just generally accepted?
> >
> >     Thanks,
> >
> >     Marcin
> >
> >
> >     _______________________________________________
> >     Moses-support mailing list
> >     [email protected] <mailto:[email protected]>
> >     http://mailman.mit.edu/mailman/listinfo/moses-support
> >
> >
> 
> 
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
> 

-- 
Ondrej Bojar (mailto:[email protected] / [email protected])
http://www.cuni.cz/~obo
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] BLEU evaluation at WMT - contractions

Reply via email to