gervaz wrote: > Hi all, I need to find all the address in a html source page, I'm > using: > 'href="(?P<url>http://mysite.com/[^"]+)">(<b>)?(?P<name>[^</a>]+)(</ > b>)?</a>' > but the [^</a>]+ pattern retrieve all the strings not containing < > or / or a etc, although I just not want the word "</a>". How can I > specify: 'do not search the string "blabla"?'
You should consider using BeautifulSoup or lxml2's error-tolerant parser to work with HTML-documents. Sooner or later your regex-based processing is bound to fail, as documents get more complicated. Better to use the right tool for the job. The code should look like this (untested): from BeautifulSoup import BeautifulSoup html = """<html><a href="http://mysite.com/foobar/baz">link</a></html>""" res = [] soup = BeautifulSoup(html) for tag in soup.findAll("a"): if tag["href"].startswith("http://mysite.com"): res.append(tag["href"]) Not so hard, and *much* more robust. Diez -- http://mail.python.org/mailman/listinfo/python-list