Hi Tomas,
I attached the srx2nbr.pl script - it is licensed under the Apache License 2.0. 
It is still very rough and the resulting files need manual editing which is why 
I haven't added this yet to the Moses for Localization project 
(http://code.google.com/p/m4loc/). languagetool.org is a good source for SRX 
files licensed under LGPL (I believe they have Polish).

For Japanese you need a word segmenter like Chasen or KyTea 
(http://www.phontron.com/kytea/).

Cheers
Achim

-----Original Message-----
From: Tomas Hudik [mailto:[email protected]] 
Sent: Wednesday, September 15, 2010 12:51 PM
To: Achim Ruopp
Cc: Philipp Koehn; [email protected]
Subject: Re: [Moses-support] tokenizer for different languages

Philipp and Achim - thanks a lot.

I'm mainly interested in Japan and Polish language. Do you have an
idea where can I get the files for these languages?
And yes - I'm interested in your SRX script - is it GNU license?  I
couldn't find it at:
http://code.google.com/p/corpus-tools/source/browse/trunk/Lingua-Sentence
Where is it located?

once more - thanks, Tomas


On Wed, Sep 15, 2010 at 5:59 PM, Achim Ruopp <[email protected]> wrote:
> I created nonbreaking_prefix files for ES, FR and IT based on some publicly
> available abbreviation lists. They are available here:
> http://code.google.com/p/corpus-tools/source/browse/trunk/Lingua-Sentence/sh
> are/
> I would take these with a grain of salt - they need to be reviewed by people
> familiar with the languages. The same location also contains a PT
> nonbreaking_prefix file authored by Hilário Leal Fontes, which I believe is
> accurate.
>
> I also have a script that converts SRX files into nonbreaking_prefix files
> with some manual editing required. Please let me know if you are interested.
>
> Achim
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Philipp Koehn
> Sent: Wednesday, September 15, 2010 11:17 AM
> To: Tomas Hudik
> Cc: [email protected]
> Subject: Re: [Moses-support] tokenizer for different languages
>
> Hi,
>
> we only provide the lists for the languages we created.
> We would be happy to include other lists in the distribution,
> if such were made available.
>
> They serve the purpose that periods after, for instance,
> "Mr." are not split off (no periods are split off if the following
> word is lowercase).
>
> You can use the tokenizer for any other language, and
> it may not make much difference, since a phrase-based model
> will happily translated, say, "Mr ." as a phrase.
>
> -phi
>
> On Wed, Sep 15, 2010 at 2:20 PM, Tomas Hudik <[email protected]> wrote:
>> Hi,
>>
>> I’ve got a question on script tokenizer.perl.
>> I’m wondering whether is it possible to get somewhere
>> nonbreaking_prefix.* for various languages. Does exist such a place?
>> Or, how I  can tokenize a text file if I don’t have enough knowledge
>> about the particular language.
>>
>> Thanks, Tomas
>>
>> _______________________________________________
>> Moses-support mailing list
>> [email protected]
>> http://mailman.mit.edu/mailman/listinfo/moses-support
>>
>
> _______________________________________________
> Moses-support mailing list
> [email protected]
> http://mailman.mit.edu/mailman/listinfo/moses-support
>
>

Attachment: srx2nbr.pl
Description: Binary data

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to