[EMAIL PROTECTED] wrote:
> I understand that the web is full of ill-formed XHTML web pages but this is
> Microsoft:
> 
> http://moneycentral.msn.com/companyreport?Symbol=BBBY
> 
> I can't validate it and my standard Python XML parsing tools don't work on it.
> 
> If this was just some teenager's web site I'd move on.  Is there any hope
> avoiding regular expression hacks to extract the data from this page?
>

Here's something that may help.

---<code>
#!/usr/bin/python -tt
"""get html page via url, and filter through _elementtidy

Hmmm, the elementtidy.so might contain other useful interfaces
 see tidyenum.h
or, maybe check
  tidy -help
or
  tidy -show-config

ref: Lundt's http://effbot.org/zone/element-tidylib.htm
"""

__version__ = "1.0"

import urllib, _elementtidy
import sys
# note: we expect that site.py has been (re-)configured to honor locale
# rather than default to ascii (ugh!) as in the 2.4 distro
sysenc = sys.getdefaultencoding()

prog = sys.argv[0]
USAGE="""\
Usage: %s <url>

retrieves an html document from url, and 'cleans it up'
via elementtidy, a library equivalent of the common tidy program
using tidy-equivalent options
  output in xthml
  write numeric character entiries
uses character encoding matching the current locale
""" % prog

def usage():
    print USAGE
    sys.exit(1)

def htmltidy(htmldata, enc=sysenc, errout=sys.stderr):
    """perform magic tidy fixup to html source giving valid xhtml"""

    # yuk, damn inconsistencies!
    if enc == "iso-8859-1":
        enc = "latin1"
    elif enc.lower() == 'utf':
        enc = 'utf8'
    else:
        enc = enc.replace("-", "")

    xml,err = _elementtidy.fixup(htmldata, enc)
    if err:
        # expect multiple lines: html problems (that were fixed)
        if errout:
            print >> errout, "ERROR: %s" % err
    return xml

def get_tidy(page):
    """page needs to be a url spec for html content -- 'file:' ok"""
    html = urllib.urlopen(page).read()
    return htmltidy(html,errout=None)

if __name__ == '__main__':
    if len(sys.argv) < 2:
        usage()

    xml = get_tidy(sys.argv[1]) 
    print xml

#===eof===
---</code>

Regards,
..jim


-- 
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list

Reply via email to