I created nonbreaking_prefix files for ES, FR and IT based on some publicly
available abbreviation lists. They are available here:
http://code.google.com/p/corpus-tools/source/browse/trunk/Lingua-Sentence/sh
are/ 
I would take these with a grain of salt - they need to be reviewed by people
familiar with the languages. The same location also contains a PT
nonbreaking_prefix file authored by Hilário Leal Fontes, which I believe is
accurate.

I also have a script that converts SRX files into nonbreaking_prefix files
with some manual editing required. Please let me know if you are interested.

Achim

-----Original Message-----
From: [email protected] [mailto:[email protected]]
On Behalf Of Philipp Koehn
Sent: Wednesday, September 15, 2010 11:17 AM
To: Tomas Hudik
Cc: [email protected]
Subject: Re: [Moses-support] tokenizer for different languages

Hi,

we only provide the lists for the languages we created.
We would be happy to include other lists in the distribution,
if such were made available.

They serve the purpose that periods after, for instance,
"Mr." are not split off (no periods are split off if the following
word is lowercase).

You can use the tokenizer for any other language, and
it may not make much difference, since a phrase-based model
will happily translated, say, "Mr ." as a phrase.

-phi

On Wed, Sep 15, 2010 at 2:20 PM, Tomas Hudik <[email protected]> wrote:
> Hi,
>
> I’ve got a question on script tokenizer.perl.
> I’m wondering whether is it possible to get somewhere
> nonbreaking_prefix.* for various languages. Does exist such a place?
> Or, how I  can tokenize a text file if I don’t have enough knowledge
> about the particular language.
>
> Thanks, Tomas
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to