hi,I merged the list of non-breaking preffixes for Spanish sent by Achim with the one I'm using (which is based on FreeLing). Please, find it attached. Find also a list of preffixes for Catalan.
Philipp, feel free to commit them, jesus On 15/09/10 18:06, Philipp Koehn wrote:
Hi, thanks - I committed them to SVN. -phi On Wed, Sep 15, 2010 at 4:59 PM, Achim Ruopp<[email protected]> wrote:I created nonbreaking_prefix files for ES, FR and IT based on some publicly available abbreviation lists. They are available here: http://code.google.com/p/corpus-tools/source/browse/trunk/Lingua-Sentence/sh are/ I would take these with a grain of salt - they need to be reviewed by people familiar with the languages. The same location also contains a PT nonbreaking_prefix file authored by Hilário Leal Fontes, which I believe is accurate. I also have a script that converts SRX files into nonbreaking_prefix files with some manual editing required. Please let me know if you are interested. Achim -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Philipp Koehn Sent: Wednesday, September 15, 2010 11:17 AM To: Tomas Hudik Cc: [email protected] Subject: Re: [Moses-support] tokenizer for different languages Hi, we only provide the lists for the languages we created. We would be happy to include other lists in the distribution, if such were made available. They serve the purpose that periods after, for instance, "Mr." are not split off (no periods are split off if the following word is lowercase). You can use the tokenizer for any other language, and it may not make much difference, since a phrase-based model will happily translated, say, "Mr ." as a phrase. -phi On Wed, Sep 15, 2010 at 2:20 PM, Tomas Hudik<[email protected]> wrote:Hi, I’ve got a question on script tokenizer.perl. I’m wondering whether is it possible to get somewhere nonbreaking_prefix.* for various languages. Does exist such a place? Or, how I can tokenize a text file if I don’t have enough knowledge about the particular language. Thanks, Tomas _______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
nonbreaking_prefix.es
Description: application/ecmascript
#Anything in this file, followed by a period (and an upper-case word), does NOT indicate an end-of-sentence marker. #Special cases are included for prefixes that ONLY appear before 0-9 numbers. #any single upper case letter followed by a period is not a sentence ender (excluding I occasionally, but we leave it in) #usually upper case letters are initials in a name A B C D E F G H I J K L M N O P Q R S T U V W X Y Z #Abbreviations aa abrev adj adm admón afma afmas afmo afmos ag am ap apdo art arts assn atte av bros bv cap caps cg cgo cia cÃa cit cl cm co col corp cos cta cte ctra cts dcha dept dg dl dm doc docs dpt dpto dr dra dras dres dto dupdo ed ej emma emmas emmo emmos entlo entpo esp etc ex excm excma excmas excmo excmos fasc fdo fig figs fol fra gral ha hnos hz ib ibid ibÃd id Ãd ilm ilma ilmas ilmo ilmos iltre inc intr Ãt izq izqda izqdo jr kc kcal kg khz kl km kw lám lda ldo lib lim ltd ma máx mg mhz min mÃn mm mr mrs mtro ntra ntro núm ob op pág págs pd ph pje pl plc pm pp pral prof pról prov ps pta ptas pte pts pza ref rr rte sec seg sig sr sra sras sres srta ss sust tech tel teléf tÃt ud uds vda vdo vid vol vols vra vro vta
_______________________________________________ Moses-support mailing list [email protected] http://mailman.mit.edu/mailman/listinfo/moses-support
