Re: [Moses-support] German tokenizer may fail with numeric endings

Ergun Bicici Tue, 06 Nov 2018 22:43:16 -0800

Funny part is trying all 1-99 :)

prefix is actually a suffix of the sentence: This need not be true since
there can be itemized lists. "1. one microsoft way from 9 to 1." Such
sentence can be frequently found in Europarl.


On Wed, Nov 7, 2018 at 1:46 AM Ozan Çağlayan <[email protected]> wrote:

> Yes the rules are coming from the nonbreaking_prefixes files which are
> text files listing which prefixes, when preceded by a <dot> should not be
> tokenized. But I think this rule should not be applied if the prefix is
> actually a suffix of the sentence. Similar situations arise for French and
> other languages as well. For french, "sec." is a non-breaking prefix which
> is the abbreviation for "seconds" but sec also means "dry". So if a
> sentence ends with the "dry" meaning of "sec." the <dot> is also not
> tokenized.
>
> When the size of the corpora goes to infinity, this means that all
> nonbreaking_prefixes for a language will end up in the model vocabulary for
> NMT.
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>


-- 

Regards,
Ergun

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] German tokenizer may fail with numeric endings

Reply via email to