Instead of recrawling the web every few months, I'd like nutch to
monitor RSS feeds for site updates. The way I'm currently thinking of
doing this is:
1. If a website indicates syndication (<link rel="alternate"
type="application/(atom|rss|rsd)">), grab the file with the
information (indicated by the "href" attribute). If there's no RDF
file, I'll fetch it using rome and have it convert the feed to RDF.
2. See if it matches with the URI's stored hash, if so, skip on to the
next site.
3. If not, fetch all URLs in the file and add them to the segments to
be indexed using WebDBGenerator.main() with the -dmozfile parameter. I
still need to determine if the DMOZ RDF file is a strict superset of
the rdf format or if it is incompatible. The validators all choke on
it because of it's size; rome and feedparser run out of memory with
it.
4. Optimise the index once every 24 hours.
Is this the best way to do what I'd like? Thanks in advance for the help!
--
Cheers,
Hasan Diwan <[EMAIL PROTECTED]>