Fred L. Drake, Jr. wrote: > On Sunday 11 June 2006 16:26, Sam Ruby wrote: > > Planet is a feed aggregator written in Python. It depends heavily on > > SGMLLib. A recent bug report turned out to be a deficiency in sgmllib, > > and I've submitted a test case and a patch[1] (use or discard the patch, > > it is the test that I care about). > > And it's a nice aggregator to use, indeed! > > > While looking around, a few things surfaced. For starters, it would > > seem that the version of sgmllib in SVN HEAD will selectively unescape > > certain character references that might appear in an attribute. I say > > selectively, as: > > > > * it will unescape & > > * it won't unescape © > > * it will unescape & > > * it won't unescape & > > * it will unescape ’ > > * it won't unescape ’ > > And just why would you use sgmllib to handle RSS or ATOM feeds? Neither is > defined in terms of SGML. The sgmllib documentation also notes that it isn't > really a fully general SGML parser (it isn't), but that it exists primarily > as a foundation for htmllib.
The feed itself is read first with SAX (then with a fallback using sgmllib if the feed is not well formed, but that's beside the point). Then the embedded HTML portions are then processed with subclasses of sgmllib. > > There are a number of issues here. While not unescaping anything is > > suboptimal, at least the recipient is aware of exactly which characters > > have been unescaped (i.e., none of them). The proposed solution makes > > it impossible for the recipient to know which characters are unescaped, > > and which are original. (Note: feeds often contain such abominations as > > &copy; which the new code will treat indistinguishably from ©) > > My suspicion is that the "right" thing to do at the sgmllib level is to > categorize the markup and call a method depending on what the entity > reference is, and let that handle whatever it is. For SGML, that means we > have things like &name; (entity references), { (character references), > and that's it. ģ isn't legal SGML under any circumstance; > the "&#x<number>;" syntax was introduced with XML. ... but it effectively is valid HTML. And as you point out below sgmllib's raison d’être is to support htmllib. > > Additionally, there is a unicode issue here - one that is shared by > > handle_charref, but at least that method is overrideable. If unescaping > > remains, do it for hex character references and for values greather than > > 8-bits, i.e., use unichr instead of chr if the value is greater than 127. > > For SGML, it's worse than that, since the document character set is defined > in > the SGML declaration, which is a far hairier beast than an XML > declaration. :-) understood > It really sounds like sgmllib is the wrong foundation for this. While the > module has some questionable behaviors, none of them are signifcant in the > context it's intended context (support for htmllib). Now, I understand that > RSS has historical issues, with HTML-as-practiced getting embedded as payload > data with various flavors of escaping applied, and I'm not an expert in the > details of that. Have you looked at HTMLParser as an alternate to sgmllib? > It has better support for XHTML constructs. HTMLParser is less forgiving, and generally less suitable for consuming HTML as practiced. - Sam Ruby _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com