[issue13633] Handling of hex character references in HTMLParser.handle_charref

Ezio Melotti Wed, 22 Feb 2012 18:38:56 -0800

Ezio Melotti <ezio.melo...@gmail.com> added the comment:

This behavior is now documented, but the situation could still be improved.  
Adding a new method that receives the converted entity seems a good way to 
handle this.  The parser can call both, and users can pick either one.


One problem with the current methods (handle_charref and handle_entityref) is 
that they don't do any processing on the entity and let invalid character 
references like &#x1000000000; or &#iamnotanentity; go through.

There are at least 3 changes that should be done in order to follow the HTML5 
standard [0]:
 1) the parser should look at html.entities while parsing named character 
references (see also #11113).  This will allow the parser to parse &notit; as 
"¬it;" and &notin; as "∉" (see note at the very end of [0]);
 2) invalid character references (e.g. &#x1000000000;, &#iamnotanentity;) 
should not go through;
 3) the table at [0] with the replacement character should be used by the 
parser to "correct" those invalid character references (e.g. 0x80 -> U+20AC);

Now, 1) can be done for both the old and new method, but for 2) and 3) the 
situation is a bit more complicated.  The best thing is probably to keep 
sending them unchanged to the old methods, and implement the correct behavior 
for the new method only.

[0]: 
http://www.w3.org/TR/html5/tokenization.html#tokenizing-character-references

----------
dependencies: +html.entities mapping dicts need updating?

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue13633>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue13633] Handling of hex character references in HTMLParser.handle_charref

Reply via email to