[issue20288] HTMLParse handing of non-numeric charrefs broken

Anders Hammarquist Fri, 17 Jan 2014 06:11:04 -0800

New submission from Anders Hammarquist:

Python 2.7 HTMLParse.py lines 185-199 (similar lines still exist in Python 3.4)
                match = charref.match(rawdata, i)
                if match:
                    ...
                else:
                    if ";" in rawdata[i:]: #bail by consuming &#
                        self.handle_data(rawdata[0:2])
                        i = self.updatepos(i, 2)
                    break


if you feed a broken charref, that is non-numeric, it will pass whatever random 
string that happened to be at the start of rawdata to handle_data(). Eg:

p = HTMLParser()
p.handle_data = lambda x: sys.stdout.write(x)
p.feed('<p>&#foo;</p>')

will print '<p' which is clearly wrong. I think the intention of the code is to 
pass '&#', which seems saner.

----------
components: Library (Lib)
messages: 208336
nosy: iko
priority: normal
severity: normal
status: open
title: HTMLParse handing of non-numeric charrefs broken
type: behavior

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue20288>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue20288] HTMLParse handing of non-numeric charrefs broken

Reply via email to