Hi Jorg,

The short answer to your question, is yes, the numbers you are
reporting are reasonable. Intersection gets around 6% AER, and Och's
refined gets around 10% AER for the training data set I worked on in
the past, which is the LDC Hansard.

Here is the longer answer to the question you didn't ask :-)

1) AER is broken for Sure and Possible links and can be gamed by
guessing fewer links. If you must use Sure vs. Possible alignments,
use Och and Ney's definition of Precision and Recall, and take 1 - the
geometric mean. (See our CL squib, kindly already cited by Adam, for
more details).

2) The gold standard alignment set is broken (I assume we are talking
about French/English btw, I think there was also German/English which
I am not familiar with). There are 4376 Sure links and 19222 Possible
links. Franz told me that the way this was generated is that two
annotators both annotated the set. Intersection of the annotators was
marked Sure, and union of the annotators was marked Possible. So the
interannotator agreement was really low. This was not done using a
GUI, btw, but instead by typing in offsets.

3) Sure vs. Possible_and_not_Sure is a nebulous distinction (see
above). If you would like the first 220 sentences of the set
reannotated as Sure only (in the spirit of Melamed's Blinker
guidelines), I can make those available. They worked better for
predicting MT performance.

4) The sentences annotated were sampled from the LDC Hansard, not the
ISI Hansard; results using the ISI Hansard are not directly comparable
(the gold standard alignments are also mismatched in time, I don't
know if this is important).

5) There are French/English alignments available for Europarl, perhaps
you should be using these instead? They use Sure vs. Possible
unfortunately. I don't know if they had French or English native
spakers, so YMMV. Not to criticize though, I bet there are errors in
my annotation as well. Many thanks to those guys for releasing their
work!!

https://www.l2f.inesc-id.pt/wiki/index.php/Word_Alignments

6) I would use unbalanced F-Measure rather than balanced F-Measure
(see again the squib, this is the main point of it). For applications
where precision is more important (such as cross-lingual retrieval),
increase alpha to weight precision more.

Cheers, Alex
---
Alexander Fraser
Institute for Natural Language Processing
University of Stuttgart
Azenbergstrasse 12
70174 Stuttgart, Germany

phone: +49 (711) 685-81375
fax:   +49 (711) 685-71400
email: [email protected]
web:   http://www.ims.uni-stuttgart.de/~fraser
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to