[issue32876] HTMLParser raises exception on some inputs

Ezio Melotti Fri, 14 Sep 2018 00:30:02 -0700


Ezio Melotti <[email protected]> added the comment:


There are at least a couple of issues here.

The first one is the way the parser handles '<![...'.  The linked page contains 
markup like '<![STAT]-[USER-ACTIVE]!>' and since the parser currently checks 
for '<![' only, _markupbase.py:parse_marked_section gets called and an error 
gets incorrectly raised.   
However "8.2.4.42. Markup declaration open state"[0], states that after 
consuming '<!', there are only 4  valid paths forward:
1) if we have '<!--', it's a comment;
2) if we have '<!doctype', it's a doctype declaration;
3) if we have '<![CDATA[', it's a CDATA section;
4) if it's something else, it's a bogus comment;

The above example should therefore fall into 4), and be treated like a bogus 
comment.

PR-9295 changes parse_html_declaration() to align to the specs and implement 
path 3), resulting in the webpage being parsed without errors (the invalid 
markup is considered as a bogus comment).


The second issue is about an EOF in the middle of a bogus markup declaration, 
like in the minified example provided by OP ("<![\n").  In this case the 
comment should still be emitted ('[\n'), but currently nothing gets emitted.  
I'll look more into it either tomorrow or later this month and update the PR 
accordingly (or perhaps I'll open a separate issue).


[0]: 
https://www.w3.org/TR/html52/syntax.html#tokenizer-markup-declaration-open-state

----------
versions: +Python 2.7, Python 3.7, Python 3.8

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue32876>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue32876] HTMLParser raises exception on some inputs

Reply via email to