Re: Regexp

Diez B. Roggisch Mon, 19 Jan 2009 06:55:38 -0800

gervaz wrote:

> Hi all, I need to find all the address in a html source page, I'm
> using:
> 'href="(?P<url>http://mysite.com/[^"]+)">(<b>)?(?P<name>[^</a>]+)(</
> b>)?</a>'
> but the [^</a>]+ pattern retrieve all the strings not containing <
> or / or a etc, although I just not want the word "</a>". How can I
> specify: 'do not search the string "blabla"?'


You should consider using BeautifulSoup or lxml2's error-tolerant parser to
work with HTML-documents. 

Sooner or later your regex-based processing is bound to fail, as documents
get more complicated. Better to use the right tool for the job.

The code should look like this (untested):

from BeautifulSoup import BeautifulSoup
html = """<html><a href="http://mysite.com/foobar/baz";>link</a></html>"""

res = []
soup = BeautifulSoup(html)
for tag in soup.findAll("a"):
    if tag["href"].startswith("http://mysite.com";):
       res.append(tag["href"])


Not so hard, and *much* more robust.

Diez
--
http://mail.python.org/mailman/listinfo/python-list

Re: Regexp

Reply via email to