Simon Cross <[EMAIL PROTECTED]> added the comment: I've tracked down the cause to the .unescape(...) method in HTMLParser. The replaceEntities function passed to re.sub() always returns a unicode character, even when matching string s is a byte string. Changing line 383 to:
return self.entitydefs[s].encode("utf-8") makes the test pass. Unfortunately this is obviously not a viable solution in the general case. The problem is that there is no way to know what character set to encode in without knowing both the HTTP headers (which are not available to HTMLParser) and looking at the XML and HTML headers. Python 3.0 implicitly rejects non-unicode strings right at the start of html.parser.HTMLParser.feed(...) by adding '' to the data passed in. Given Python 3.0's behaviour, the docs should perhaps be updated to say HTMLParser does not support non-unicode strings? If it should support byte strings, we'll have to figure out how to handle encoded entity issues. It's a bit weird that character and entity references outside tags/attributes result in calls to .entityref(...) and .charref(...) while those inside get unescape called automatically. Don't really see what can be done about that though. ---------- versions: +Python 2.7 _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue3932> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com