Hi, I came up with the following procedure
ALLCAPS = "|ALLCAPS"
NOCAPS = "|NOCAPS"
MIDCAPS = "|MIDCAPS"
CAPS = "|CAPS"
DIGIT = "|DIGIT"
def test_case(w):
w_out = ''
if w.isalpha(): #se la virgola non ci entra
if w.isupper():
w_out = w.lower() + ALLCAPS
return w_out
elif w.islower():
w_out = w + NOCAPS
return w_out
else:
m = re.match("^[A-Z]",w)
if m:
w_out = w.lower() + CAPS #notsure about this..
return w_out
else:
w_out = w.lower() + MIDCAPS
return w_out
elif w.isdigit():
w_out = w + DIGIT
return w_out
Called in here:
#=========================
lines = 0
for s in file:
lines += 1
if lines % 1000 == 0:
print '%d lines' % lines
#sent = sent.replace(",","")
sent = s.split() #split string by spaces
for w in sent:
wout= test_case(w)
#==========================
But I don't know if I'm doing something sensible? Moreover:
- test_case has problems, cause whenever It finds some punctuation
character attached to some word, doesn't tag it. I was thinking of
cleaning the line of the punctuation before using split on it (see
commented row) but I don't know if I have to call that replace() once
for every punctuation char?
-Is there a way to reprint the tagged text in a file including punctuation?
-Is my test_case a good start? Would you use regular expressions?
Thanks very much!
F.
--
http://mail.python.org/mailman/listinfo/python-list