Hi, we only provide the lists for the languages we created. We would be happy to include other lists in the distribution, if such were made available.
They serve the purpose that periods after, for instance, "Mr." are not split off (no periods are split off if the following word is lowercase). You can use the tokenizer for any other language, and it may not make much difference, since a phrase-based model will happily translated, say, "Mr ." as a phrase. -phi On Wed, Sep 15, 2010 at 2:20 PM, Tomas Hudik <[email protected]> wrote: > Hi, > > I’ve got a question on script tokenizer.perl. > I’m wondering whether is it possible to get somewhere > nonbreaking_prefix.* for various languages. Does exist such a place? > Or, how I can tokenize a text file if I don’t have enough knowledge > about the particular language. > > Thanks, Tomas > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
