Hi Tomas, I attached the srx2nbr.pl script - it is licensed under the Apache License 2.0. It is still very rough and the resulting files need manual editing which is why I haven't added this yet to the Moses for Localization project (http://code.google.com/p/m4loc/). languagetool.org is a good source for SRX files licensed under LGPL (I believe they have Polish).
For Japanese you need a word segmenter like Chasen or KyTea (http://www.phontron.com/kytea/). Cheers Achim -----Original Message----- From: Tomas Hudik [mailto:[email protected]] Sent: Wednesday, September 15, 2010 12:51 PM To: Achim Ruopp Cc: Philipp Koehn; [email protected] Subject: Re: [Moses-support] tokenizer for different languages Philipp and Achim - thanks a lot. I'm mainly interested in Japan and Polish language. Do you have an idea where can I get the files for these languages? And yes - I'm interested in your SRX script - is it GNU license? I couldn't find it at: http://code.google.com/p/corpus-tools/source/browse/trunk/Lingua-Sentence Where is it located? once more - thanks, Tomas On Wed, Sep 15, 2010 at 5:59 PM, Achim Ruopp <[email protected]> wrote: > I created nonbreaking_prefix files for ES, FR and IT based on some publicly > available abbreviation lists. They are available here: > http://code.google.com/p/corpus-tools/source/browse/trunk/Lingua-Sentence/sh > are/ > I would take these with a grain of salt - they need to be reviewed by people > familiar with the languages. The same location also contains a PT > nonbreaking_prefix file authored by Hilário Leal Fontes, which I believe is > accurate. > > I also have a script that converts SRX files into nonbreaking_prefix files > with some manual editing required. Please let me know if you are interested. > > Achim > > -----Original Message----- > From: [email protected] [mailto:[email protected]] > On Behalf Of Philipp Koehn > Sent: Wednesday, September 15, 2010 11:17 AM > To: Tomas Hudik > Cc: [email protected] > Subject: Re: [Moses-support] tokenizer for different languages > > Hi, > > we only provide the lists for the languages we created. > We would be happy to include other lists in the distribution, > if such were made available. > > They serve the purpose that periods after, for instance, > "Mr." are not split off (no periods are split off if the following > word is lowercase). > > You can use the tokenizer for any other language, and > it may not make much difference, since a phrase-based model > will happily translated, say, "Mr ." as a phrase. > > -phi > > On Wed, Sep 15, 2010 at 2:20 PM, Tomas Hudik <[email protected]> wrote: >> Hi, >> >> I’ve got a question on script tokenizer.perl. >> I’m wondering whether is it possible to get somewhere >> nonbreaking_prefix.* for various languages. Does exist such a place? >> Or, how I can tokenize a text file if I don’t have enough knowledge >> about the particular language. >> >> Thanks, Tomas >> >> _______________________________________________ >> Moses-support mailing list >> [email protected] >> http://mailman.mit.edu/mailman/listinfo/moses-support >> > > _______________________________________________ > Moses-support mailing list > [email protected] > http://mailman.mit.edu/mailman/listinfo/moses-support > >
srx2nbr.pl
Description: Binary data
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
