I am having problems using the tagged and annotated output on dev/corpus 
files, specifically news-test2008, when there are intermediate periods (.) in 
the text. If this is truly an end-of-sentence marker, both TreeTagger and 
BitPar will interpret it correctly. If it is an abbreviation marker, however, 
TreeTagger will see it as an abbreviation, but BitPar will misinterpret it as 
an end-of-sentence marker, and the two will be out of sync.

For example, BitPar thinks the following contain two sentences:

# Am 9. Dezember
# wie z.B. die geringe Fahrpraxis

It is trivial to write a Perl script to change the intermediate dots to, for 
example, "Am 9., Dezember". The question is, what would be the best 
substitution (is this the right way in the first place), and what 
ramifications would this have on tuning (this is being used as a tuning 
corpus)?

Thank you!

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to