Re: [Moses-support] German tokenizer may fail with numeric endings

Ergun Bicici Tue, 06 Nov 2018 13:52:01 -0800

There might be some rule that prevents. Scripts contain language specific
tokenization rules and they are checked in a sequence.


Did you try all 1-99? :)

On Mon, Nov 5, 2018 at 9:15 PM Ozan Çağlayan <[email protected]> wrote:

> Hello,
>
> I just discovered that the German tokenizer does not split the final <dot>
> if preceded by a number. This is because of the nonbreaking prefixes file
> which lists ordinals in the form '<number>.'. Since the list is between
> 1-99, for numbers > 99, the tokenizer works correctly. Here's a sentence
> from europarl:
>
> $ echo 'Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll
> die Änderungsanträge 2 und *3.*' | tokenizer.perl -q -l de
> Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die
> Änderungsanträge 2 und *3.*
>
> $ echo 'Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll
> die Änderungsanträge 2 und *100.*' | tokenizer.perl -q -l de
> Sie akzeptiert im Prinzip die Änderungsanträge 5 und 6 und voll die
> Änderungsanträge 2 und *100 .*
>
>
>
> --
> Ozan Caglayan
> PhD student @ University of Le Mans
> Team LST -- Language and Speech Technology
> http://www.ozancaglayan.com
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


-- 

Regards,
Ergun

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] German tokenizer may fail with numeric endings

Reply via email to