Re: [Moses-support] tokenizer for different languages

Jesús Giménez Mon, 20 Sep 2010 01:41:25 -0700

hi,

I merged the list of non-breaking preffixes for Spanish sent by Achim with the one I'm using (which is based on FreeLing). Please, find it attached. Find also a list of preffixes for Catalan.


  Philipp, feel free to commit them,

  jesus


On 15/09/10 18:06, Philipp Koehn wrote:

Hi,

thanks - I committed them to SVN.

-phi

On Wed, Sep 15, 2010 at 4:59 PM, Achim Ruopp<[email protected]>  wrote:

I created nonbreaking_prefix files for ES, FR and IT based on some publicly
available abbreviation lists. They are available here:
http://code.google.com/p/corpus-tools/source/browse/trunk/Lingua-Sentence/sh
are/
I would take these with a grain of salt - they need to be reviewed by people
familiar with the languages. The same location also contains a PT
nonbreaking_prefix file authored by Hilário Leal Fontes, which I believe is
accurate.

I also have a script that converts SRX files into nonbreaking_prefix files
with some manual editing required. Please let me know if you are interested.

Achim

-----Original Message-----
From: [email protected] [mailto:[email protected]]
On Behalf Of Philipp Koehn
Sent: Wednesday, September 15, 2010 11:17 AM
To: Tomas Hudik
Cc: [email protected]
Subject: Re: [Moses-support] tokenizer for different languages

Hi,

we only provide the lists for the languages we created.
We would be happy to include other lists in the distribution,
if such were made available.

They serve the purpose that periods after, for instance,
"Mr." are not split off (no periods are split off if the following
word is lowercase).

You can use the tokenizer for any other language, and
it may not make much difference, since a phrase-based model
will happily translated, say, "Mr ." as a phrase.

-phi

On Wed, Sep 15, 2010 at 2:20 PM, Tomas Hudik<[email protected]>  wrote:

Hi,

I’ve got a question on script tokenizer.perl.
I’m wondering whether is it possible to get somewhere
nonbreaking_prefix.* for various languages. Does exist such a place?
Or, how I  can tokenize a text file if I don’t have enough knowledge
about the particular language.

Thanks, Tomas

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

nonbreaking_prefix.es
Description: application/ecmascript

#Anything in this file, followed by a period (and an upper-case word), does NOT 
indicate an end-of-sentence marker.
#Special cases are included for prefixes that ONLY appear before 0-9 numbers.

#any single upper case letter  followed by a period is not a sentence ender 
(excluding I occasionally, but we leave it in)
#usually upper case letters are initials in a name
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z

#Abbreviations
aa
abrev
adj
adm
admÃ³n
afma
afmas
afmo
afmos
ag
am
ap
apdo
art
arts
assn
atte
av
bros
bv
cap
caps
cg
cgo
cia
cÃa
cit
cl
cm
co
col
corp
cos
cta
cte
ctra
cts
dcha
dept
dg
dl
dm
doc
docs
dpt
dpto
dr
dra
dras
dres
dto
dupdo
ed
ej
emma
emmas
emmo
emmos
entlo
entpo
esp
etc
ex
excm
excma
excmas
excmo
excmos
fasc
fdo
fig
figs
fol
fra
gral
ha
hnos
hz
ib
ibid
ibÃd
id
Ãd
ilm
ilma
ilmas
ilmo
ilmos
iltre
inc
intr
Ãt
izq
izqda
izqdo
jr
kc
kcal
kg
khz
kl
km
kw
lÃ¡m
lda
ldo
lib
lim
ltd
ma
mÃ¡x
mg
mhz
min
mÃn
mm
mr
mrs
mtro
ntra
ntro
nÃºm
ob
op
pÃ¡g
pÃ¡gs
pd
ph
pje
pl
plc
pm
pp
pral
prof
prÃ³l
prov
ps
pta
ptas
pte
pts
pza
ref
rr
rte
sec
seg
sig
sr
sra
sras
sres
srta
ss
sust
tech
tel
telÃ©f
tÃt
ud
uds
vda
vdo
vid
vol
vols
vra
vro
vta

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Re: [Moses-support] tokenizer for different languages

Reply via email to