Hi! thanks for all the replies. and thanks for the interesting paper on AER and the relation to BLEU scores. quite embarassing that I haven't seen it before.
Alexander: I would be interested in your gold standard data. would be nice if you could make them available. it remains a tricky business with the word alignment evaluation. what would be the best way to compare results with previously reported experiments? most people did use AER as you also mention in your paper. from your discussion I conclude that for english-french an F-measure with alpha=0.4 would be a good setting. (to be sure: you mean the harmonic mean and not the geometric mean, right) but what would be the right thing to do to compare results on standard sets? By the way, are there any other studies on the influence of word alignment quality for other purposes than standard SMT? I was again thinking of approaches like Hiero, SAMT, maybe tree alignment and other types of transfer rule extraction, annotation/grammar projection, bilingual lexicon/terminology extraction etc. I'm just curious. cheers, jorg > > Here is the longer answer to the question you didn't ask :-) > > 1) AER is broken for Sure and Possible links and can be gamed by > guessing fewer links. If you must use Sure vs. Possible alignments, > use Och and Ney's definition of Precision and Recall, and take 1 - >the > geometric mean. (See our CL squib, kindly already cited by Adam, for > more details). > > 2) The gold standard alignment set is broken (I assume we are >talking > about French/English btw, I think there was also German/English >which > I am not familiar with). There are 4376 Sure links and 19222 >Possible > links. Franz told me that the way this was generated is that two > annotators both annotated the set. Intersection of the annotators >was > marked Sure, and union of the annotators was marked Possible. So the > interannotator agreement was really low. This was not done using a > GUI, btw, but instead by typing in offsets. > > 3) Sure vs. Possible_and_not_Sure is a nebulous distinction (see > above). If you would like the first 220 sentences of the set > reannotated as Sure only (in the spirit of Melamed's Blinker > guidelines), I can make those available. They worked better for > predicting MT performance. > > 4) The sentences annotated were sampled from the LDC Hansard, not >the > ISI Hansard; results using the ISI Hansard are not directly >comparable > (the gold standard alignments are also mismatched in time, I don't > know if this is important). > > 5) There are French/English alignments available for Europarl, >perhaps > you should be using these instead? They use Sure vs. Possible > unfortunately. I don't know if they had French or English native > spakers, so YMMV. Not to criticize though, I bet there are errors in > my annotation as well. Many thanks to those guys for releasing their > work!! > > https://www.l2f.inesc-id.pt/wiki/index.php/Word_Alignments > > 6) I would use unbalanced F-Measure rather than balanced F-Measure > (see again the squib, this is the main point of it). For >applications > where precision is more important (such as cross-lingual retrieval), > increase alpha to weight precision more. > > Cheers, Alex > --- > Alexander Fraser > Institute for Natural Language Processing > University of Stuttgart > Azenbergstrasse 12 > 70174 Stuttgart, Germany > > phone: +49 (711) 685-81375 > fax: +49 (711) 685-71400 > email: [email protected] > web: http://www.ims.uni-stuttgart.de/~fraser > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
