On Jan 7, 1:09 am, John Nagle <[EMAIL PROTECTED]> wrote: > Another in our ongoing series on "Parsing Real-World HTML". > > It's wrong, of course. But Firefox will accept as HTML escapes > > & > > > < > > as well as the correct forms > > & > > > < > > To be "compatible", a Python screen scraper at > > http://zesty.ca/python/scrape.py > > has a function "htmldecode", which is supposed to recognize > HTML escapes and generate Unicode. (Why isn't this a standard > Python library function? Its inverse is available.) > > This uses the regular expression > > charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?',re.UNICODE) > > to recognize HTML escapes. > > Note the ";?", which makes the closing ";" optional. > > This seems fine until we hit something valid but unusual like > > http://www.example.com?foo=1?? > > for which "htmldecode" tries to convert "1234567" into > a Unicode character with that decimal number, and gets a > Unicode overflow. > > For our own purposes, I rewrote "htmldecode" to require a > sequence ending in ";", which means some bogus HTML escapes won't > be recognized, but correct HTML will be processed correctly. > What's general opinion of this behavior? Too strict, or OK? > > John Nagle > SiteTruth
Maybe htmltidy could help: http://tidy.sourceforge.net/ ? -- http://mail.python.org/mailman/listinfo/python-list