Re: beautifulsoup .vs tidy

Ravi Teja Sat, 01 Jul 2006 15:55:46 -0700

Paul Boddie wrote:
> Ravi Teja wrote:
> >
> > 1.) XPath is not a good idea at all with "malformed" HTML or perhaps
> > web pages in general.
>
> import libxml2dom
> import urllib
> f = urllib.urlopen("http://wiki.python.org/moin/";)
> s = f.read()
> f.close()
> # s contains HTML not XML text
> d = libxml2dom.parseString(s, html=1)
> # get the community-related links
> for label in d.xpath("//li[.//a/text() = 'Community']//li//a/text()"):
>     print label.nodeValue


I wasn't aware that your module does html as well.

> Of course, lxml should be able to do this kind of thing as well. I'd be
> interested to know why this "is not a good idea", though.

No reason that you don't know already.

http://www.boddie.org.uk/python/HTML.html

"If the document text is well-formed XML, we could omit the html
parameter or set it to have a false value."

XML parsers are not required to be forgiving to be regarded compliant.
And much HTML out there is not well formed.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: beautifulsoup .vs tidy

Reply via email to