Re: [backstage] News Sniffer - inner workings

2006-10-26 Thread Mr I Forrester

Thanks John,


I certainly found this useful, specially how you use a mix of RSS and 
HTML scraping to get it working (which I somewhat expected)



I've not played with the Have your say RSS feeds, and I or someone else 
will feed this information back to the HYS tech team.



Cheers,


Ian


John Leach wrote:


Hi,

Ian Forrester invited me to discuss my News Sniffer project on here, so
here goes.

I originally wrote the 'Watch Your Mouth' in Ruby as a little toy.  It
uses the HYS RSS feeds to record all comments and spot when ones
disappear (taking into consideration the ones that slide off the bottom
of the feed).  All the data was in static YAML files and it wrote the
results out to static HTML.

As the number of threads and comments I was monitoring grew, it became
clear I needed something a bit more scalable.

A few months back I rewrote it to use MySQL and the Ruby on Rails
framework.  Ruby on Rails allowed me to quickly develop the fancy front
end, with the voting system etc.  The use of MySQL obviously increased
the speed of comment look-ups, but also enabled me to distribute the
back-end work across multiple servers.  This obviously gives me more
bandwidth and CPU resources, but also helps evade firewalling :)

After rewriting 'Watch Your Mouth', I wrote the Revisionista feature
using the same technology.  It uses RSS feeds to identify articles to
monitor, then HTML scraping to turn it into plain text.

With all the attention News Sniffer has been getting lately, I spent
some time adding fragment caching to speed things up.  HTML fragments
are stored in a Memcached service so that the distributed back-end
processes can expire them properly.

I'm currently wasting a lot of bandwidth due to your HYS RSS feeds
having no useful caching headers.  IIRC, the last modified header always
reports the current time, there are no etags, and there aren't even any
Content-length headers!  I believe this is the same for your news
articles too (though it's been a few weeks since I last poked around
tbh)

Due to time constraints right now, I'll talk a little about why I set it
up later.

John.
http://johnleach.co.uk

On Tue, 2006-10-24 at 15:46 +0100, Ian Forrester wrote:
  

Hi,

Someone posted up your site on the backstage mailing list [1]. We're
interested how your system works from a technological point of view.
Are you using RSS and a mix of html scraping or is it totally RSS
filtering? Also what kind of environment are you working in? Perl,
PHP, Java, Ruby?

I thought you might also want to talk in detail about why you set it
up and how?. Our mailing list is public [2] and I think it might be a
good place to talk about it.

Looking forward to your reply,

Ian Forrester || backstage.bbc.co.uk

[1] - http://backstage.bbc.co.uk/
[2] - http://www.mail-archive.com/backstage@lists.bbc.co.uk/




-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/
  


-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/


RE: [backstage] News Sniffer - inner workings

2006-10-26 Thread Kevin Hinde
I'm currently wasting a lot of bandwidth due to your HYS RSS feeds
having no useful caching headers.  IIRC, the last modified 
header always
reports the current time, there are no etags, and there aren't even any
Content-length headers!  I believe this is the same for your news
articles too (though it's been a few weeks since I last poked around
tbh)

Yes, I'm afraid we only have content-length and Etag on our non-HYS RSS
feeds.

Kevin.

-
Sent via the backstage.bbc.co.uk discussion group.  To unsubscribe, please 
visit http://backstage.bbc.co.uk/archives/2005/01/mailing_list.html.  
Unofficial list archive: http://www.mail-archive.com/backstage@lists.bbc.co.uk/