Hi, I came up with the following procedure

ALLCAPS = "|ALLCAPS"
NOCAPS = "|NOCAPS"
MIDCAPS = "|MIDCAPS"
CAPS = "|CAPS"
DIGIT = "|DIGIT"

def test_case(w):

    w_out = ''

    if w.isalpha(): #se la virgola non ci entra
        if w.isupper():
            w_out = w.lower() + ALLCAPS
            return w_out
        elif w.islower():
            w_out = w + NOCAPS
            return w_out
        else:
            m = re.match("^[A-Z]",w)
            if m:
                w_out = w.lower() + CAPS #notsure about this..
                return w_out
            else:
                w_out = w.lower() + MIDCAPS
                return w_out
    elif w.isdigit():
        w_out = w + DIGIT
        return w_out

Called in here:
#=========================
   lines = 0
    for s in file:
        lines += 1
        if lines % 1000 == 0:
            print '%d lines' % lines
        #sent = sent.replace(",","")
        sent = s.split() #split string by spaces
        for w in sent:
            wout= test_case(w)
#==========================

But I don't know if I'm doing something sensible? Moreover:

- test_case has problems, cause whenever It finds some punctuation character attached to some word, doesn't tag it. I was thinking of cleaning the line of the punctuation before using split on it (see commented row) but I don't know if I have to call that replace() once for every punctuation char?
-Is there a way to reprint the tagged text in a file including punctuation?
-Is my test_case a good start? Would you use regular expressions?

Thanks very much!
F.
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to