Fred L. Drake, Jr. wrote: > On Monday 12 June 2006 00:05, Sam Ruby wrote: > > Just to be clear: Planet uses Mark's feed parser, which uses SGMLlib. > > Cool. > > > I was investigating a bug in sgmllib which affected the feed parser (and > > therefore Planet), and noticed that there were changes in the SVN head > > of Python which broke three feed parser unit tests. > > > > It is my belief that these changes will break other existing users of > > sgmllib. > > This is good to know; thanks for pointing it out. > > If you can summarize the specific changes to sgmllib that cause problems for > the feed parser, and identify the tests there that rely on the old behavior, > I'll be glad to look at the problems. I expect to have some time in the next > few evenings, so I should be able to look at these soon. > > Is the SourceForge CVS the definitive development source for the feed parser?
Yes: but if you check out the CVS HEAD, you won't see any failures as I've committed changes that mitigate the problems I've found. However, if you get the latest release instead, you will see that feeds that contain < & or > in attribute values will get these converted to <, &, and > characters instead. In some cases, this can cause problems. Particularly if the output is reparsed by sgmllib. Additionally, entity references in the range of  to ÿ will cause the released Feed Parser to die with a UnicodeDecodeError. My workarounds are to re-escape < and > characters, and to escape bare ampersands - beyond that I can't really tell for sure which ampersands need to be re-escaped, and which ones I should leave as is. And I first try decoding attributes in the original declared encoding and then fall back to iso-8859-1. If a single attribute value contains both non-ASCII utf-8 characters and a numeric character reference above € then this will produce incorrect results. I also have committed a workaround to the incorrect parsing of attributes with quoted markup that I originally reported. - Sam Ruby _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com