Sam Ruby wrote: > Planet is a feed aggregator written in Python. It depends heavily on > SGMLLib. A recent bug report turned out to be a deficiency in sgmllib, > and I've submitted a test case and a patch[1] (use or discard the patch, > it is the test that I care about).
I think (but am not sure) you are referring to patch #1462498 here, which fixes bugs 1452246 and 1087808. > * it will unescape & > * it won't unescape © That must be because you have amp in your entitydefs, but not copy. > * it will unescape & > * it won't unescape & That's because it doesn't recognize hex character references. That's systematic, though: it doesn't just ignore them in attribute values, but also in content. > * it will unescape ’ > * it won't unescape ’ That's because the value is larger than 256, so chr() fails. > There are a number of issues here. While not unescaping anything is > suboptimal, at least the recipient is aware of exactly which characters > have been unescaped (i.e., none of them). The proposed solution makes > it impossible for the recipient to know which characters are unescaped, > and which are original. (Note: feeds often contain such abominations as > © which the new code will treat indistinguishably from ©) The recipient should then add © to entitydefs; sgmllib will unescape copy, so the recipient can know not to unescape that. Alternatively, the recipient could provide an empty entitydefs. > Additionally, there is a unicode issue here - one that is shared by > handle_charref, but at least that method is overrideable. If unescaping > remains, do it for hex character references and for values greather than > 8-bits, i.e., use unichr instead of chr if the value is greater than 127. Alternatively, a callback function could be provided for character references. Unfortunately, the existing callback is unsuitable, as it is supposed to do the full processing; this callback should return the replacement text. Generally assuming Unicode would be wrong, though. Would you like to contribute a patch? Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com