Re: [Moses-support] tokenizer for different languages

Philipp Koehn Wed, 15 Sep 2010 08:18:00 -0700

Hi,

we only provide the lists for the languages we created.
We would be happy to include other lists in the distribution,
if such were made available.

They serve the purpose that periods after, for instance,
"Mr." are not split off (no periods are split off if the following
word is lowercase).

You can use the tokenizer for any other language, and
it may not make much difference, since a phrase-based model
will happily translated, say, "Mr ." as a phrase.

-phi

On Wed, Sep 15, 2010 at 2:20 PM, Tomas Hudik <[email protected]> wrote:
> Hi,
>
> I’ve got a question on script tokenizer.perl.
> I’m wondering whether is it possible to get somewhere
> nonbreaking_prefix.* for various languages. Does exist such a place?
> Or, how I  can tokenize a text file if I don’t have enough knowledge
> about the particular language.
>
> Thanks, Tomas
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] tokenizer for different languages

Reply via email to