On Sun, 09 Jan 2011 16:49:35 +0000, Tom Anderson wrote: > > Any thoughts on what i should do? Do i have to bite the bullet and apply > some cleverness in my pattern generation to avoid situations like this? > This sort of works: import re f = open("test.txt") p = re.compile("(spam*)*") for line in f: print "input line: %s" % (line.strip()) for m in p.findall(line): if m != "": print "==> %s" % (m)
when I feed it =======================test.txt=========================== a line with no match spa should match spam should match so should all of spaspamspammspammm and so should all of spa spam spamm spammm no match again. =======================test.txt=========================== it produces: input line: a line with no match input line: spa should match ==> spa input line: spam should match ==> spam input line: so should all of spaspamspammspammm ==> spammm input line: and so should all of spa spam spamm spammm ==> spa ==> spam ==> spamm ==> spammm input line: no match again. so obviously there's a problem with greedy matching where there are no separators between adjacent matching strings. I tried non-greedy matching, e.g. r'(spam*?)*', but this was worse, so I'll be interested to see how the real regex mavens do it. -- martin@ | Martin Gregorie gregorie. | Essex, UK org | -- http://mail.python.org/mailman/listinfo/python-list