Ezio Melotti <ezio.melo...@gmail.com> added the comment: This behavior is now documented, but the situation could still be improved. Adding a new method that receives the converted entity seems a good way to handle this. The parser can call both, and users can pick either one.
One problem with the current methods (handle_charref and handle_entityref) is that they don't do any processing on the entity and let invalid character references like � or &#iamnotanentity; go through. There are at least 3 changes that should be done in order to follow the HTML5 standard [0]: 1) the parser should look at html.entities while parsing named character references (see also #11113). This will allow the parser to parse ¬it; as "¬it;" and ∉ as "∉" (see note at the very end of [0]); 2) invalid character references (e.g. �, &#iamnotanentity;) should not go through; 3) the table at [0] with the replacement character should be used by the parser to "correct" those invalid character references (e.g. 0x80 -> U+20AC); Now, 1) can be done for both the old and new method, but for 2) and 3) the situation is a bit more complicated. The best thing is probably to keep sending them unchanged to the old methods, and implement the correct behavior for the new method only. [0]: http://www.w3.org/TR/html5/tokenization.html#tokenizing-character-references ---------- dependencies: +html.entities mapping dicts need updating? _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue13633> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com