[Moses-support] German tokenizer may fail with numeric endings

Ozan Çağlayan Mon, 05 Nov 2018 10:18:02 -0800

Hello,

I just discovered that the German tokenizer does not split the final <dot>
if preceded by a number. This is because of the nonbreaking prefixes file
which lists ordinals in the form '<number>.'. Since the list is between
1-99, for numbers > 99, the tokenizer works correctly. Here's a sentence
from europarl:


$ echo 'Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die
Änderungsanträge 2 und *3.*' | tokenizer.perl -q -l de
Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die
Änderungsanträge 2 und *3.*

$ echo 'Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die
Änderungsanträge 2 und *100.*' | tokenizer.perl -q -l de
Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die
Änderungsanträge 2 und *100 .*



-- 
Ozan Caglayan
PhD student @ University of Le Mans
Team LST -- Language and Speech Technology
http://www.ozancaglayan.com

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

[Moses-support] German tokenizer may fail with numeric endings

Reply via email to