Funny part is trying all 1-99 :) prefix is actually a suffix of the sentence: This need not be true since there can be itemized lists. "1. one microsoft way from 9 to 1." Such sentence can be frequently found in Europarl.
On Wed, Nov 7, 2018 at 1:46 AM Ozan Çağlayan <[email protected]> wrote: > Yes the rules are coming from the nonbreaking_prefixes files which are > text files listing which prefixes, when preceded by a <dot> should not be > tokenized. But I think this rule should not be applied if the prefix is > actually a suffix of the sentence. Similar situations arise for French and > other languages as well. For french, "sec." is a non-breaking prefix which > is the abbreviation for "seconds" but sec also means "dry". So if a > sentence ends with the "dry" meaning of "sec." the <dot> is also not > tokenized. > > When the size of the corpora goes to infinity, this means that all > nonbreaking_prefixes for a language will end up in the model vocabulary for > NMT. > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > -- Regards, Ergun
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
