I run http://10.am and do this on a largish scale.
For aggregating RSS feeds I use RSSLite [1] rather than XML::RSS. RSSLite
avoids using expat and is a little naughty in parsing XML that would make
expat barf ( Alot of RSS feeds unfortunatly contain bad XML ).
For actual scaping of sites I basically use meaty regexps or HTML::Parser.
10.am also supplys feeds [2] in RSS if you want to use them.
I hope to Open Source 10.am in the near future when I sort out some
contractual obligations.
mallum
[1] http://industrial-linux.org/RSSLite/
[2] http://10.am/docs/feeds.htm (eg http://10.am/Development/Perl-rss )
on Wed, Mar 07, 2001 at 04:36:56PM +0000, Dave Hodgkinson wrote:
>
> What's the best way to scrape a variety of news headlines from various
> sites? Sort of a moreover for the intranet...
>
>
> --
> Dave Hodgkinson, http://www.hodgkinson.org
> Editor-in-chief, The Highway Star http://www.deep-purple.com
> Apache, mod_perl, MySQL, Sybase hired gun for, well, hire
> -----------------------------------------------------------------
>