John Leach
Thu, 26 Oct 2006 04:00:40 -0700
Hi, Ian Forrester invited me to discuss my News Sniffer project on here, so here goes.
I originally wrote the 'Watch Your Mouth' in Ruby as a little toy. It uses the HYS RSS feeds to record all comments and spot when ones disappear (taking into consideration the ones that slide off the bottom of the feed). All the data was in static YAML files and it wrote the results out to static HTML. As the number of threads and comments I was monitoring grew, it became clear I needed something a bit more scalable. A few months back I rewrote it to use MySQL and the Ruby on Rails framework. Ruby on Rails allowed me to quickly develop the fancy front end, with the voting system etc. The use of MySQL obviously increased the speed of comment look-ups, but also enabled me to distribute the back-end work across multiple servers. This obviously gives me more bandwidth and CPU resources, but also helps evade firewalling :) After rewriting 'Watch Your Mouth', I wrote the Revisionista feature using the same technology. It uses RSS feeds to identify articles to monitor, then HTML scraping to turn it into plain text. With all the attention News Sniffer has been getting lately, I spent some time adding fragment caching to speed things up. HTML fragments are stored in a Memcached service so that the distributed back-end processes can expire them properly. I'm currently wasting a lot of bandwidth due to your HYS RSS feeds having no useful caching headers. IIRC, the last modified header always reports the current time, there are no etags, and there aren't even any Content-length headers! I believe this is the same for your news articles too (though it's been a few weeks since I last poked around tbh) Due to time constraints right now, I'll talk a little about why I set it up later. John. http://johnleach.co.uk On Tue, 2006-10-24 at 15:46 +0100, Ian Forrester wrote: > Hi, > > Someone posted up your site on the backstage mailing list [1]. We're > interested how your system works from a technological point of view. > Are you using RSS and a mix of html scraping or is it totally RSS > filtering? Also what kind of environment are you working in? Perl, > PHP, Java, Ruby? > > I thought you might also want to talk in detail about why you set it > up and how?. Our mailing list is public [2] and I think it might be a > good place to talk about it. > > Looking forward to your reply, > > Ian Forrester || backstage.bbc.co.uk > > [1] - http://backstage.bbc.co.uk/ > [2] - http://www.mail-archive.com/backstage@lists.bbc.co.uk/ - Sent via the backstage.bbc.co.uk discussion group. To unsubscribe, please visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html. Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/