Re: [Moses-support] Statistical significance test

Baskaran Sankaran Tue, 09 Apr 2013 13:36:00 -0700

Hi Jonathan,

Thanks for your reply. I usually prefer to use mteval script that allows me
to retain the original reference and secondly this also ensures consistency
among all the systems being evaluated. With this, I just detokenize the
system output and measure lower-cased BLEU for all systems.


Btw, I didn't understand this:

... It shouldn't be a huge job to do the same with Moses' bootstrap
resampling script, extracting its normalization as a separate step. ...

As far as I know, the bootstrap resampling script doesn't do any
normalization. At least I don't see such a version in my Moses directory
obtained from github around June '12.

Anyway, the BLEU scores are below. These scores are on the Arabic-English
MTA testset of 1313 sentences. The system was trained on the Ar-En ISI
parallel corpus (~1.1M sentence pairs) and tuned on a separate MTA tuning
set of 1664 sentences. The scores marked with * are statistically
significant at p = 0.01. Pls see below for explanation about footnotes 1
and 2.


Baseline

System-1

mteval-11b 1

36.06

36.16

- w/o normalization 2,3

35.73

35.89

multi-bleu.pl 2

34.15

34.52

bootstrap resampling 2

34.15

34.52*

- with normalization 1,4

34.59

34.90*

multeval (0.4.3) 2

32.70

33.30*

1 Uses original reference with detokenized and normalized system out
2 Uses detokenized reference with raw system output
3 It disables call to the NormalizeText() method in mteval-11b
4 Calls the NormalizeText() method (copied from mteval-11b) as a pre-process

Any suggestions??

cheers
- Baskaran


On Tue, Apr 9, 2013 at 7:36 AM, Jonathan Clark <[email protected]>wrote:

> Hi Baskaran,
>
> I've had similar issues when dealing with metric scripts that perform
> their own normalization. As a first step, you might consider performing
> normalization as a pre-processing step and disabling all normalization
> within the scripts. Michael Denkowski has a version of mteval that has
> normalization disabled:
> https://github.com/mjdenkowski/meteor/tree/master/mt-diff/files. It
> shouldn't be a huge job to do the same with Moses' bootstrap resampling
> script, extracting its normalization as a separate step. This will at least
> allow you to examine the inputs and blame either text normalization or
> mathematics. Selfishly, I'd also like to know if multeval's bootstrap
> resampling differs in its calculations of bootstrap resampling. :)
>
> Usually, I'm not a fan of doing any normalization besides the tokenization
> inherent to the MT system, but I know sometimes this isn't an option if you
> don't have control over one of the systems involved in the comparison.
>
> Could you also post absolute BLEU scores? Sometimes, smoothing can make a
> difference with lower-scoring systems.
>
> Cheers,
> Jon
>
>
> On Mon, Apr 8, 2013 at 9:13 PM, Baskaran Sankaran <[email protected]>wrote:
>
>> Hi group,
>>
>> I need to compute statistical significance between a pair of system
>> outputs and I've used the bootstrap resampling script in Moses.
>> Unfortunately the BLEU scores from this script differs substantially (about
>> 1.5 points short) than that of standard mteval script. I've also tried
>> applying the same text normalization routine from mteval into the bootstrap
>> resampling script (and modified the script bit so that it would normalize
>> both hyps and refs) but the scores are still different.
>>
>> The problem is that the moses bootstrap script suggests some system
>> output to be statistically significant than a baseline (having absolute
>> BLEU difference of 0.3), but the mteval BLEU score difference between those
>> systems is only 0.1.
>>
>> I know multeval is an option, but again the scores are different and it
>> doesn't do normalization. Any suggestions?
>>
>> Thanks
>> - Baskaran
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>>
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] Statistical significance test

Reply via email to