hi,

I merged the list of non-breaking preffixes for Spanish sent by Achim with the one I'm using (which is based on FreeLing). Please, find it attached. Find also a list of preffixes for Catalan.

  Philipp, feel free to commit them,

  jesus


On 15/09/10 18:06, Philipp Koehn wrote:
Hi,

thanks - I committed them to SVN.

-phi

On Wed, Sep 15, 2010 at 4:59 PM, Achim Ruopp<[email protected]>  wrote:
I created nonbreaking_prefix files for ES, FR and IT based on some publicly
available abbreviation lists. They are available here:
http://code.google.com/p/corpus-tools/source/browse/trunk/Lingua-Sentence/sh
are/
I would take these with a grain of salt - they need to be reviewed by people
familiar with the languages. The same location also contains a PT
nonbreaking_prefix file authored by Hilário Leal Fontes, which I believe is
accurate.

I also have a script that converts SRX files into nonbreaking_prefix files
with some manual editing required. Please let me know if you are interested.

Achim

-----Original Message-----
From: [email protected] [mailto:[email protected]]
On Behalf Of Philipp Koehn
Sent: Wednesday, September 15, 2010 11:17 AM
To: Tomas Hudik
Cc: [email protected]
Subject: Re: [Moses-support] tokenizer for different languages

Hi,

we only provide the lists for the languages we created.
We would be happy to include other lists in the distribution,
if such were made available.

They serve the purpose that periods after, for instance,
"Mr." are not split off (no periods are split off if the following
word is lowercase).

You can use the tokenizer for any other language, and
it may not make much difference, since a phrase-based model
will happily translated, say, "Mr ." as a phrase.

-phi

On Wed, Sep 15, 2010 at 2:20 PM, Tomas Hudik<[email protected]>  wrote:
Hi,

I’ve got a question on script tokenizer.perl.
I’m wondering whether is it possible to get somewhere
nonbreaking_prefix.* for various languages. Does exist such a place?
Or, how I  can tokenize a text file if I don’t have enough knowledge
about the particular language.

Thanks, Tomas

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support


_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Attachment: nonbreaking_prefix.es
Description: application/ecmascript

#Anything in this file, followed by a period (and an upper-case word), does NOT 
indicate an end-of-sentence marker.
#Special cases are included for prefixes that ONLY appear before 0-9 numbers.

#any single upper case letter  followed by a period is not a sentence ender 
(excluding I occasionally, but we leave it in)
#usually upper case letters are initials in a name
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z

#Abbreviations
aa
abrev
adj
adm
admón
afma
afmas
afmo
afmos
ag
am
ap
apdo
art
arts
assn
atte
av
bros
bv
cap
caps
cg
cgo
cia
cía
cit
cl
cm
co
col
corp
cos
cta
cte
ctra
cts
dcha
dept
dg
dl
dm
doc
docs
dpt
dpto
dr
dra
dras
dres
dto
dupdo
ed
ej
emma
emmas
emmo
emmos
entlo
entpo
esp
etc
ex
excm
excma
excmas
excmo
excmos
fasc
fdo
fig
figs
fol
fra
gral
ha
hnos
hz
ib
ibid
ibíd
id
íd
ilm
ilma
ilmas
ilmo
ilmos
iltre
inc
intr
ít
izq
izqda
izqdo
jr
kc
kcal
kg
khz
kl
km
kw
lám
lda
ldo
lib
lim
ltd
ma
máx
mg
mhz
min
mín
mm
mr
mrs
mtro
ntra
ntro
núm
ob
op
pág
págs
pd
ph
pje
pl
plc
pm
pp
pral
prof
pról
prov
ps
pta
ptas
pte
pts
pza
ref
rr
rte
sec
seg
sig
sr
sra
sras
sres
srta
ss
sust
tech
tel
teléf
tít
ud
uds
vda
vdo
vid
vol
vols
vra
vro
vta
_______________________________________________
Moses-support mailing list
[email protected]
http://mailman.mit.edu/mailman/listinfo/moses-support

Reply via email to