Le Sun, 3 May 2009 21:59:23 -0400,
Dan Liang <[email protected]> s'exprima ainsi:

> Hi tutors,
> 
> I am working on a file and need to replace each occurrence of a certain
> label (part of speech tag in this case) by a number of sub-labels. The file
> has the following format:
> 
> word1  \t    Tag1
> word2  \t    Tag2
> word3  \t    Tag3
> 
> Now the tags are complex and I wanted to split them in a tab-delimited
> fashion to have this:
> 
> word1   \t   Tag1Part1   \t   Tag2Part2   \t   Tag3Part3
> 
> I searched online for some solution and found the code below which uses a
> dictionary to store the tags that I want to replace in keys and the sub-tags
> as values. The problem with this is that it sometimes replaces tags that are
> not surrounded by spaces, which I do not like to happen*1*. Also, I wanted
> each new sub-tag to be followed by a tab, so that the new items that I end
> up having in my file are tab-delimited. For this, I put tabs between the
> items of each key in the dictionary*2*. I started thinking that this will
> not be the best solution of the problem and perhaps a script that uses
> regular expressions would be better*3*. Since I am new to Python, I thought
> I should ask you for your thoughts for a best solution. The items I want to
> replace are about 150 and I did not know how to iterate over them with
> regular expressions.

*3* I think regular expressions are not the proper tool here. Because you are 
knew and it's really hairy. But above all because they help parsing, not 
rewriting. Here the input is very simple, while you have some work for the 
replacement function.

*1* If the source really looks like above, then as I understand it, "tags that 
are
not surrounded by spaces" can only occur in words (eg the word 'noun'). On more 
reason for not using regex. You just need to read each line, keep the left part 
unchanged an cope with the tag. An issue is that you replace tags "blindly", 
without taking into account the easy structure of the source -- which would 
help you.

*2* I would rather have a dict which values are lists of (sub)tags. Then let a 
replacement function cope with output formatting.
word_dic = {
'abbrev': ['abbrev, null, null'],
'adj': ['adj, null, null'],
'adv': ['adv, null, null'],
...
}
It's not only cleaner, it lets you modify formatting at will. The dict is only 
constant *data*. Separating data from process is good practice.

I would do something like (untested):


tags = {......, 'foo':['foo1','foo2,'foo3'],..........} # tag dict
TAB = '\t'

def newlyTaggedWord(line):
        (word,tag) = line.split(TAB)    # separate parts of line, keeping data 
only
        new_tags = tags['tag']          # read in dict
        tagging = TAB.join(new_tags)    # join with TABs
        return word + TAB + tagging     # formatted result

def replaceTagging(source_name, target_name):
        source_file = file(source_name, 'r')
        source = source_file.read()             # not really necessary
        target_file = open(target_name, "w")
        # replacement loop
        for line in source:
                new_line = newlyTaggedWord(line) + '\n'
                target_file.write(new_line)
        source_file.close()
        target_file.close()
        
if __name__ == "__main__"       
        source_name = sys.argv[1]
        target_name = sys.argv[2]
        replaceTagging(source_name, target_name)



> Below is my previous code:
> 
> 
> #!usr/bin/python
> 
> import re, sys
> f = file(sys.argv[1])
> readed= f.read()
> 
> def replace_words(text, word_dic):
>     for k, v in word_dic.iteritems():
>         text = text.replace(k, v)
>     return text
> 
> # the dictionary has target_word:replacement_word pairs
> 
> word_dic = {
> 'abbrev': 'abbrev    null    null',
> 'adj': 'adj    null    null',
> 'adv': 'adv    null    null',
> 'case_def_acc': 'case_def    acc    null',
> 'case_def_gen': 'case_def    gen    null',
> 'case_def_nom': 'case_def    nom    null',
> 'case_indef_acc': 'case_indef    acc    null',
> 'verb_part': 'verb_part    null    null'}
> 
> 
> # call the function and get the changed text
> 
> myString = replace_words(readed, word_dic)
> 
> 
> fout = open(sys.argv[2], "w")
> fout.write(myString)
> fout.close()
> 
> --dan


------
la vita e estrany
_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Reply via email to