[feedparser] Large number of bug fixes for Jakarta Feed Parser

Brad Neuberg Wed, 02 Mar 2005 16:46:27 -0800

The last few weeks I have been working on improving Rojo's feed subscription system, which uses the Jakarta Feed Parser. To do so I built an automated testing framework that scripted our UI and ran a large amount of data through it, fixing subscription bugs as I encountered them. A number of bugs I was able to trace into the Jakarta Feed Parser and fixed them. I am attaching a patch file that has these bug fixes. Here were the following changes that I needed to fix the following bugs:

* Xanga feeds were not working. Xanga does not support the autodiscovery standard and doesn't have anything in their HTML that allows us to use HTML link probing to find an RSS or Atom feed. Instead, we have to do aggresive probing at well known locations. Unfortunately, some blogging services incorrectly support HTTP redirects and also return 200 OKs on files that don't exist, giving false positives. When the ProbeLocator is running an probing for RSS files at well known locations, for some blogging services we have to pay attention to the value returned by an HTTP redirect and for others we have to ignore any redirects. I've added a new method named followRedirects() to the base BlogService class; each individual blogging service now returns either true or false on whether to follow redirects that may be returned when probing this remote service. It turns out that for all other services that we currently deal with we must ignore redirects but not for Xanga. The Xanga BlogService class returns true for followRedirects(), making it possible to work with these blogs now. * The FeedLocator class does three different kinds of RSS discovery: autodiscovery probing, HTML source analysis, and aggresive probing. When each of these stages were adding discovered links to our list of found RSS feeds they were not first checking to make sure we hadn't already found that particular link through a different kind of discovery mechanism. This has been fixed. This lead to duplicate feeds in the list, which broke downstream systems that use the Jakarta Feed Parser. * Exact subscriptions to Craigs List feeds were not working, such as "http://www.craigslist.org/w4m/index.xml";. A FeedReference can either be absolute, such as "http://rss.groups.yahoo.com/group/talkinaboutarchitecture/rss";, or relative, such as "/atom.xml". When we are doing the ProbeLocator, we first discover the BlogService we are dealing with and then "ask" that blog service for its usual list of feed locations, which are returned as FeedReferences. I added a FeedReference.isRelative() method so that when we are in the ProbeLocator we correctly build up the HTTP path to do remote probing of this particular FeedReference based on whether it is a relative or absolute path. * Yahoo Groups were not working. I modified the YahooGroups BlogService object to correctly work. * When subscribing to some feeds a NullPointerException was thrown in the EntityDecoder; fixed this. * AOL LiveJournal feeds were not working. I modified the AOLJournal BlogService object to correctly work. * Some feed services, such as AOL LiveJournal, are case-sensitive when retrieving feeds. We were incorrectly lower casing all feeds in DiscoveryLocator and BlogServiceDiscovery; we now keep the case that is discovered through autodiscovery. * I rewrote parts of ResourceExpander; a large number of feeds weren't being subscribed due to bugs in how we were expanding URIs.

I also discovered a serious bug in our LinkLocator process that I wasn't able to fix. It turns out that we scan the document looking for certain kinds of links to see if they are RSS links; however, we ignore an A HREF tag if it has an image inside. This is extremely dangerous, though, since most pages have the orange XML icon on their page, hyperlinked to their feed! If this is fixed I suspect we will be able to find a much larger amount of feeds through the LinkLocator in the future.

The patch is too big to place here. I have put it on my web server at http://codinginparadise.org/feedparser/feed_refactor_patch_01_02_2005.txt

Best,
Brad Neuberg, [EMAIL PROTECTED]
Senior Software Engineer, Rojo Networks
Weblog: http://www.codinginparadise.org

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[feedparser] Large number of bug fixes for Jakarta Feed Parser

Reply via email to