[EMAIL PROTECTED] wrote: > I understand that the web is full of ill-formed XHTML web pages but this is > Microsoft: > > http://moneycentral.msn.com/companyreport?Symbol=BBBY > > I can't validate it and my standard Python XML parsing tools don't work on it. > > If this was just some teenager's web site I'd move on. Is there any hope > avoiding regular expression hacks to extract the data from this page? >
Here's something that may help. ---<code> #!/usr/bin/python -tt """get html page via url, and filter through _elementtidy Hmmm, the elementtidy.so might contain other useful interfaces see tidyenum.h or, maybe check tidy -help or tidy -show-config ref: Lundt's http://effbot.org/zone/element-tidylib.htm """ __version__ = "1.0" import urllib, _elementtidy import sys # note: we expect that site.py has been (re-)configured to honor locale # rather than default to ascii (ugh!) as in the 2.4 distro sysenc = sys.getdefaultencoding() prog = sys.argv[0] USAGE="""\ Usage: %s <url> retrieves an html document from url, and 'cleans it up' via elementtidy, a library equivalent of the common tidy program using tidy-equivalent options output in xthml write numeric character entiries uses character encoding matching the current locale """ % prog def usage(): print USAGE sys.exit(1) def htmltidy(htmldata, enc=sysenc, errout=sys.stderr): """perform magic tidy fixup to html source giving valid xhtml""" # yuk, damn inconsistencies! if enc == "iso-8859-1": enc = "latin1" elif enc.lower() == 'utf': enc = 'utf8' else: enc = enc.replace("-", "") xml,err = _elementtidy.fixup(htmldata, enc) if err: # expect multiple lines: html problems (that were fixed) if errout: print >> errout, "ERROR: %s" % err return xml def get_tidy(page): """page needs to be a url spec for html content -- 'file:' ok""" html = urllib.urlopen(page).read() return htmltidy(html,errout=None) if __name__ == '__main__': if len(sys.argv) < 2: usage() xml = get_tidy(sys.argv[1]) print xml #===eof=== ---</code> Regards, ..jim -- [email protected] http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list
