Re: [Moses-support] METEOR: difference between ranking task and other tasks

Marcin Junczys-Dowmunt Wed, 26 Nov 2014 13:49:18 -0800

Thanks, that's a very useful answer. I figured something similar, but Iwas curious how come these huge differences between the methods arenever reported anywhere. Even in your paper they are just a few percent.

Also, could it be that the default METEOR setting is slighltyoverfitting to the WMT ranking task? I have the impression that forsystems that have generally higher BLEU scores than WMT systems (beyond45% BLEU) METEOR seems to flatten out, barely changing values, whileBLEU differences are 4-6% absolute. This is not happening for BLEUvalues around 20-30%, METEOR scales nearly linearly in that range,following BLEU scores quite closely.

Cheers,
Marcin


W dniu 26.11.2014 o 22:31, Michael Denkowski pisze:

Hi Marcin,
Meteor scores can vary widely across tasks due to the training dataand goal. The default ranking task tries to replicate WMT rankings,so the absolute scores are not as important as the relative scoresbetween systems. The adequacy task tries to fit Meteor scores tonumeric adequacy judgements as linearly as possible. If you'relooking to evaluate a system in isolation to see if the translationsare "good", you can simulate an adequacy scale with the "adq" task.If you're comparing multiple systems, you should get the most reliableranking with the default "rank" task, but the absolute scores will beless meaningful.
Best,
Michael
On Wed, Nov 26, 2014 at 9:34 AM, Marcin Junczys-Dowmunt<[email protected] <mailto:[email protected]>> wrote:
    Hi,

    A question concerning METEOR, maybe someone has some experience. I
    am seeing huge differences between values for English with the
    defauly task "ranking" and any other of the tasks (e.g. "adq"). up
    to 30-40 points. Is this normal? In the literature I only ever see
    marginal differences of maybe 1 or 2 per cent but nothing like 35%
    vs. 65%. For the language independent setting is still get a score
    of 55%.

    See for instance:
    http://www.cs.cmu.edu/~alavie/METEOR/pdf/meteor-wmt11.pdf
    <http://www.cs.cmu.edu/%7Ealavie/METEOR/pdf/meteor-wmt11.pdf> for
    the Urdu-English system for much smaller differences between
    "ranking" and "adq". I get the same discrepancies with
    meteor-1.3.jar and meteor-1.5.jar

    Cheers,

    Marcin


    _______________________________________________
    Moses-support mailing list
    [email protected] <mailto:[email protected]>
    http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] METEOR: difference between ranking task and other tasks

Reply via email to