I created nonbreaking_prefix files for ES, FR and IT based on some publicly available abbreviation lists. They are available here: http://code.google.com/p/corpus-tools/source/browse/trunk/Lingua-Sentence/sh are/ I would take these with a grain of salt - they need to be reviewed by people familiar with the languages. The same location also contains a PT nonbreaking_prefix file authored by Hilário Leal Fontes, which I believe is accurate.
I also have a script that converts SRX files into nonbreaking_prefix files with some manual editing required. Please let me know if you are interested. Achim -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Philipp Koehn Sent: Wednesday, September 15, 2010 11:17 AM To: Tomas Hudik Cc: [email protected] Subject: Re: [Moses-support] tokenizer for different languages Hi, we only provide the lists for the languages we created. We would be happy to include other lists in the distribution, if such were made available. They serve the purpose that periods after, for instance, "Mr." are not split off (no periods are split off if the following word is lowercase). You can use the tokenizer for any other language, and it may not make much difference, since a phrase-based model will happily translated, say, "Mr ." as a phrase. -phi On Wed, Sep 15, 2010 at 2:20 PM, Tomas Hudik <[email protected]> wrote: > Hi, > > Ive got a question on script tokenizer.perl. > Im wondering whether is it possible to get somewhere > nonbreaking_prefix.* for various languages. Does exist such a place? > Or, how I can tokenize a text file if I dont have enough knowledge > about the particular language. > > Thanks, Tomas > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
