Paul Boddie wrote:
> Ravi Teja wrote:
> >
> > 1.) XPath is not a good idea at all with "malformed" HTML or perhaps
> > web pages in general.
>
> import libxml2dom
> import urllib
> f = urllib.urlopen("http://wiki.python.org/moin/")
> s = f.read()
> f.close()
> # s contains HTML not XML text
> d = libxml2dom.parseString(s, html=1)
> # get the community-related links
> for label in d.xpath("//li[.//a/text() = 'Community']//li//a/text()"):
> print label.nodeValue
I wasn't aware that your module does html as well.
> Of course, lxml should be able to do this kind of thing as well. I'd be
> interested to know why this "is not a good idea", though.
No reason that you don't know already.
http://www.boddie.org.uk/python/HTML.html
"If the document text is well-formed XML, we could omit the html
parameter or set it to have a false value."
XML parsers are not required to be forgiving to be regarded compliant.
And much HTML out there is not well formed.
--
http://mail.python.org/mailman/listinfo/python-list