A 0-width positive lookahead is probably what you want here: >>> s = """ ... hdhd <a href="http://mysite.com/blah.html">Test <i>String</i> OK</ a> ... ... """ >>> p = r'href="(http://mysite.com/[^"]+)">(.*)(?=</a>)' >>> m = re.search(p, s) >>> m.group(1) 'http://mysite.com/blah.html' >>> m.group(2) 'Test <i>String</i> OK'
The (?=...) bit is the lookahead, and won't consume any of the string you are searching. I've binned the named groups for clarity. The beautiful soup answers are a better bet though - they've already done the hard work, and after all, you are trying to roll your own partial HTML parser here, which will struggle with badly formed html... -- http://mail.python.org/mailman/listinfo/python-list