Hi Jorg, The short answer to your question, is yes, the numbers you are reporting are reasonable. Intersection gets around 6% AER, and Och's refined gets around 10% AER for the training data set I worked on in the past, which is the LDC Hansard.
Here is the longer answer to the question you didn't ask :-) 1) AER is broken for Sure and Possible links and can be gamed by guessing fewer links. If you must use Sure vs. Possible alignments, use Och and Ney's definition of Precision and Recall, and take 1 - the geometric mean. (See our CL squib, kindly already cited by Adam, for more details). 2) The gold standard alignment set is broken (I assume we are talking about French/English btw, I think there was also German/English which I am not familiar with). There are 4376 Sure links and 19222 Possible links. Franz told me that the way this was generated is that two annotators both annotated the set. Intersection of the annotators was marked Sure, and union of the annotators was marked Possible. So the interannotator agreement was really low. This was not done using a GUI, btw, but instead by typing in offsets. 3) Sure vs. Possible_and_not_Sure is a nebulous distinction (see above). If you would like the first 220 sentences of the set reannotated as Sure only (in the spirit of Melamed's Blinker guidelines), I can make those available. They worked better for predicting MT performance. 4) The sentences annotated were sampled from the LDC Hansard, not the ISI Hansard; results using the ISI Hansard are not directly comparable (the gold standard alignments are also mismatched in time, I don't know if this is important). 5) There are French/English alignments available for Europarl, perhaps you should be using these instead? They use Sure vs. Possible unfortunately. I don't know if they had French or English native spakers, so YMMV. Not to criticize though, I bet there are errors in my annotation as well. Many thanks to those guys for releasing their work!! https://www.l2f.inesc-id.pt/wiki/index.php/Word_Alignments 6) I would use unbalanced F-Measure rather than balanced F-Measure (see again the squib, this is the main point of it). For applications where precision is more important (such as cross-lingual retrieval), increase alpha to weight precision more. Cheers, Alex --- Alexander Fraser Institute for Natural Language Processing University of Stuttgart Azenbergstrasse 12 70174 Stuttgart, Germany phone: +49 (711) 685-81375 fax: +49 (711) 685-71400 email: [email protected] web: http://www.ims.uni-stuttgart.de/~fraser _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
