Hi!

thanks for all the replies. and thanks for the interesting paper on 
AER and the relation to BLEU scores. quite embarassing  that I haven't 
seen it before.

Alexander: I would be interested in your gold standard data.  would be 
nice if you could make them available.

it remains a tricky business with the word alignment evaluation. what 
would be the best way to compare results with previously reported 
experiments? most people did use AER as you also mention in your 
paper. from your discussion I conclude that for english-french an 
F-measure with alpha=0.4 would be a good setting. (to be sure: you 
mean the harmonic mean and not the geometric mean, right) but what 
would be the right thing to do to compare results on standard sets?

By the way, are there any other studies on the influence of word 
alignment quality for other purposes than standard SMT? I was again 
thinking of approaches like Hiero, SAMT, maybe tree alignment and 
other types of transfer rule extraction, annotation/grammar 
projection,  bilingual lexicon/terminology extraction etc.

I'm just curious.
cheers,

jorg

> 
> Here is the longer answer to the question you didn't ask :-)
> 
> 1) AER is broken for Sure and Possible links and can be gamed by
> guessing fewer links. If you must use Sure vs. Possible alignments,
> use Och and Ney's definition of Precision and Recall, and take 1 - 
>the
> geometric mean. (See our CL squib, kindly already cited by Adam, for
> more details).
> 
> 2) The gold standard alignment set is broken (I assume we are 
>talking
> about French/English btw, I think there was also German/English 
>which
> I am not familiar with). There are 4376 Sure links and 19222 
>Possible
> links. Franz told me that the way this was generated is that two
> annotators both annotated the set. Intersection of the annotators 
>was
> marked Sure, and union of the annotators was marked Possible. So the
> interannotator agreement was really low. This was not done using a
> GUI, btw, but instead by typing in offsets.
> 
> 3) Sure vs. Possible_and_not_Sure is a nebulous distinction (see
> above). If you would like the first 220 sentences of the set
> reannotated as Sure only (in the spirit of Melamed's Blinker
> guidelines), I can make those available. They worked better for
> predicting MT performance.
> 
> 4) The sentences annotated were sampled from the LDC Hansard, not 
>the
> ISI Hansard; results using the ISI Hansard are not directly 
>comparable
> (the gold standard alignments are also mismatched in time, I don't
> know if this is important).
> 
> 5) There are French/English alignments available for Europarl, 
>perhaps
> you should be using these instead? They use Sure vs. Possible
> unfortunately. I don't know if they had French or English native
> spakers, so YMMV. Not to criticize though, I bet there are errors in
> my annotation as well. Many thanks to those guys for releasing their
> work!!
> 
> https://www.l2f.inesc-id.pt/wiki/index.php/Word_Alignments
> 
> 6) I would use unbalanced F-Measure rather than balanced F-Measure
> (see again the squib, this is the main point of it). For 
>applications
> where precision is more important (such as cross-lingual retrieval),
> increase alpha to weight precision more.
> 
> Cheers, Alex
> ---
> Alexander Fraser
> Institute for Natural Language Processing
> University of Stuttgart
> Azenbergstrasse 12
> 70174 Stuttgart, Germany
> 
> phone: +49 (711) 685-81375
> fax:   +49 (711) 685-71400
> email: [email protected]
> web:   http://www.ims.uni-stuttgart.de/~fraser
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to