[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

Ezio Melotti Tue, 26 Jul 2011 23:52:23 -0700

Ezio Melotti <ezio.melo...@gmail.com> added the comment:

I left a review about your patch on rietveld, including a description of what I 
think it's going on there (the patch lacks some context and it's not easy to 
figure out how everything works there).
I also did some tests with and without the patch:


>>> from HTMLParser import HTMLParser as HP
>>> class MyHP(HP):
...   def handle_data(self, data): print 'data: %r' % data
... 
>>> myhp = MyHP()

# without the patch:
>>> myhp.feed('<script>foobar</script>')
data: 'foobar'  # this looks ok
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo'  # where's the </p>?
>>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
data: '<p>foo' # some tags missing, 2 chunks received
data: 'bar'
>>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")
data: '<p>foo'
data: " '"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 150, in goahead
    k = self.parse_endtag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 317, in parse_endtag
    self.error("bad end tag: %r" % (rawdata[i:j],))
  File "/usr/lib/python2.7/HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</scr'+'ipt>", at line 1, column 247


# with the patch:
>>> myhp.feed('<script>foobar</script>')
data: 'foobar'  # ok
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo' # all the content is there, but why 2 chunks?
data: '</p>'
>>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
data: '<p>foo' # same as previous
data: '</p>'
data: '<span>bar'
data: '</span>'
>>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")  
data: '<p>foo' # same
data: '</p>'
data: " '"
data: "</scr'+'ipt>"
data: "' <span>bar"
data: '</span>'

So my question is: is it normal that the data is passed to handle_data in 
chunks?
AFAIU HTML parser should see CDATA as a single chunk of bytes they don't care 
about, so the fact that further parsing happens on the content of script/style 
seems wrong to me.
If I'm reading the code correctly that's because the "interesting" regex is set 
to look for a closing tag ('</') -- maybe assuming that the CDATA section 
doesn't contain any other tag (usually true in case of <style>, often false for 
<script>).
Changing the regex to explicitly look for the closing tag might be better (but 
still fail for e.g. <script> 
document.write('<script>alert("foo")</script>')</script> -- but some browsers 
will fail with this too).

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue670664>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue670664] HTMLParser.py - more robust SCRIPT tag parsing

Reply via email to