Yes the rules are coming from the nonbreaking_prefixes files which are text
files listing which prefixes, when preceded by a <dot> should not be
tokenized. But I think this rule should not be applied if the prefix is
actually a suffix of the sentence. Similar situations arise for French and
other languages as well. For french, "sec." is a non-breaking prefix which
is the abbreviation for "seconds" but sec also means "dry". So if a
sentence ends with the "dry" meaning of "sec." the <dot> is also not
tokenized.

When the size of the corpora goes to infinity, this means that all
nonbreaking_prefixes for a language will end up in the model vocabulary for
NMT.
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to